Follow Inspur

Inspur AIStation Empowers Efficient GPU Resource Sharing

By Rongguo Zhang, AI R&D Engineer, Inspur Information

GPUs offer a huge advantage in large-scale parallel computing, offering ideal computing acceleration for big data, AI training & inference, and image rendering, etc. However, GPU processing often faces many real world challenges like poor resource management and low utilization. AIStation is an Inspur-developed AI development platform specifically designed to deal with these issues by offering an easy to set up and refined GPU resource scheduling system.

Pain points for maintaining GPU computing resources

For any AI developers, AI system researchers, and enterprises in the midst of digital transformation, the following issues may be experienced when utilizing GPU computing resources:

  • Poor management of GPU resources caused by GPU resources often being shared among multiple processes, persons, and tasks.
  • Low utilization of GPU resources from the failure to make full use of all GPUs for AI services with low computing demands.
  • Difficulty in quickly requesting and recycling GPU resources, auto-scaling needs to be performed according to the queries per second (QPS) to meet the demands of online AI services.

To deal with these issues, the AIStation inference platform allows steady allocation, scheduling, and management of fine-grained GPU resources, presenting itself as an optimal solution for enterprise users to efficiently utilize GPU resources.

Overview of AIStation's GPU sharing

The AIStation inference platform has a GPU sharing system that allows any applications that utilize GPUs as computing resources to have a single GPU accelerator card shared among multiple containers (or services). AIStation offers capabilities for fine-grained allocation and scheduling of the GPU memory and kernel. More specifically, it allows both fine-grained GPU kernel and memory division. Users can deploy different types of services on the same GPU so that the utilization rate of GPU resources can reach 100%. Moreover, AIStation ensures memory isolation among different services. By calculating the optimal scheduling strategy, AIStation offers a scheme that minimizes surplus resources and ensures security for pre-deployed services. When services are properly scheduled to different GPUs, idle GPU resources are available to other services. This resource scheduling also applies to GPU resources across nodes.

AIStation can also offer fine-grained GPU resource scaling based on HPA and QPS, which indicates that the number of copies of services can be scaled according to such metrics as CPU utilization, average memory utilization, and QPS.

Extremely low loss in computing performance. AIStation's GPU sharing system has an average performance loss of 1.3%, which has minimal impact on application performance.

AIStation's scenario-based design

Non-invasive architectural design. AIStation can be easily integrated into other platforms, and deployed with only YAML and Docker images. It is available out-of-the-box and ready to go.

High availability (HA). In the GPU sharing system, each control component is designed to be highly available. At the same time, only one of the instances for each module is the leader, which delegates activity for the module. If the leader crashes, a new leader is immediately selected to ensure high availability of the control component.

Refined monitoring. AIStation can monitor the GPU utilization of each user’s applications in real time, and calculate and store relevant data, thus facilitating GPU utilization monitoring in a refined and real-time manner.

Typical cases

Financial industry

A company in the financial industry was in need of a unified algorithm application platform for its insurance service to centrally manage different ISV algorithm applications, and improve resource utilization. The reuse rate of their GPU resources was severely restrained, requiring human intervention to handle the massive amounts of inference and calculations. If the peak load readjustment was not executed in time, various problems such as slow responses to requests, high computation latency, and computation interruption would emerge.

By introducing the AIStation inference platform, resource management was greatly improved in all large-scale inference scenarios. Most notably, the reuse rate of GPU resources was increased by 300%. This not only allowed the customer to flexibly deal with different types of online inference services, but also greatly enhanced the stability of their business system.

Energy industry

A company in the energy industry has two 8-card V100 GPU servers with 32 GB of memory. This is shared amongst a 28-person development team. The company required the proper allocation of the available 16 GPUs for inference tests by its developers. With fewer GPUs than developers, efficiently allocating and utilizing GPU resources was a major problem.

With Inspur AIStation, each GPU was divided into 8 instances, and was allocated 4 GB of memory. In this way, the 16 GPU cards allowed for 128 instances for developers, with 4 to 5 instances available to each developer. The utilization rate of each GPU became 8 times higher than before.

Related Blog