By Baihong Li, Senior Architect
MLPerf is a suite of industry-standard benchmarks for measuring machine learning performance. It was established in 2018 by several organizations including Google, Harvard University, Stanford University, and Baidu. The benchmarks measure the system training or inference time given a target metric and the results are released every year. MLPerf training tasks include image classification (ResNet50), object detection (SSD and Mask R-CNN), recommendation (DLRM), natural language processing (BERT), reinforcement machine learning (Mini Go), etc. Two new benchmarks were introduced in the latest Training v1.0 suite: speech recognition (RNN-T) and image segmentation (medical) (3D U-Net). This document will focus on the image classification model ResNet50.
ResNet, short for residual network, is a convolutional neural network widely used for image classification. It is also a classic neural network used as a backbone for many computer vision tasks. The architecture of ResNet50 is shown as below. A convolution operation is performed on the input first, and a fully connected layer is used at the end for classification after 4 residual blocks. ResNet50 contains 50 convolution layers.
Figure 1. ResNet architecture (top) and ResNet34 architecture (bottom)
(Source: Deep Residual Learning for Image Recognition by He Kaiming, et al.)
The ResNet50 training task is included in the earliest MLPerf v0.5 benchmarks. The figure below shows the best single server performance results of ResNet50 in the previous MLPerf training benchmarks. In the MLPerf Training v0.7 benchmarks, Inspur AI server NF5488A5 completed the ResNet50 training task in 33.37 minutes, 16.1% faster than other servers with the same configuration, and ranked first in single server performance. In the latest MLPerf Traning v1.0 benchmarks, Inspur AI server NF5688M6 completed the ResNet50 training task in 27.38 minutes, 17.95% faster than the v0.7 result.
Figure 2. Optimal single server performance results of ResNet50 in previous MLPerf training benchmarks
The development and optimization of hardware and software contribute to performance breakthroughs. Let's drill down on how Inspur got this result, the computing platform requirements for ResNet50, and how to improve training speed.
Introduction to ResNet50 Training
In the MLPerf Training v1.0 benchmarks, the dataset used for ResNet50 is an ImageNet 2012 dataset containing 1.28 million images (note: registration is required to download the data). The quality target is 75.9% and the minimum number of runs is 5. The result submitted is the training time (in minutes) to get to the target accuracy. Smaller values indicate better performance. The final benchmark is the mean of 3 runs (for a total of 5 runs), dropping the highest and lowest results.
Let's take a look at the ResNet50 training process. First, the server needs to read the training set from the hard drive, decode it, and preprocess the images. Then train the preprocessed data with the training framework to get the model with the target accuracy after a finite number of epochs.
Figure 3. ResNet50 training process
Hardware Platform Selection
Hardware and platform selection is important for ResNet50 training performance. Drive read performance, CPU computing performance, CPU-GPU transmission performance, and GPU computing performance have significant impacts on the training speed. Among them, drive read performance determines how fast training data is served. With NVIDIA DALI, CPU performance, CPU-GPU transmission bandwidth, and GPU performance jointly determine how fast data is preprocessed. The forward inference and backward propagation are determined by GPU performance and GPU-GPU transmission bandwidth. As a metaphor, any worker who fails to keep up with others on an assembly line will lead to a product pile-up. Likewise, in a server, you need to make sure no hardware has lower comparative performance that would lead to bottlenecks affecting the final result.
For this MLPerf benchmark, Inspur used NF5688M6 and NF5488A5 servers as the ResNet50 training platform. The servers can complete training tasks quickly by combining the robust key components mentioned above to improve their performance, meeting the hardware performance requirements for training.
Within a 6U chassis, NF5688M6 accommodates 2 of the latest Intel Ice Lake CPUs and 8 of latest NVIDIA 500W A100 GPUs interconnected via NVSwitch. It enables high-speed CPU-GPU data transmission with PCIe 4.0 interconnection, while adopting air cooling technologies of an independent air duct to prevent backflow that offers a stable environment for 8 500W A100 GPUs even in high ambient temperatures. In the MLPerf Training v1.0 benchmarks, NF5688M6 ranked first in single server performance for ResNet50, DLRM, and SSD.
Within a 4U chassis, NF5488A5 accommodates 8 high-performance NVIDIA A100 GPUs with liquid cooling and 2 AMD EPYC 7742 CPUs with PCIe 4.0, delivering powerful single server training performance and data throughput for AI applications. NF5488A5 set the record for single server training performance of ResNet50 in the MLPerf Training v0.7 benchmarks, and ranked first for single server performance of BERT in MLPerf Training v1.0.
Figure 4. Inspur server NF5688M6 (left) and NF5488A5 (right)
Training Tuning Method
The training time for ResNet50 models is mostly affected by two factors. One is the number of steps required to reach the target accuracy. With the same performance for other parameters, the fewer number of steps required, the shorter the training time. So we need to find the hyperparameters to decrease the number of steps. The other is the speed for each step shown in Figure 3, including reading data, preprocessing data, and training. The ResNet50 training data is an ImageNet 2012 dataset containing 1.28 million images, which sets a high requirement for transmission bandwidth and computing power. According to the bucket theory, model training speed is determined by the slowest part in the pipeline. So we need to analyze every step in the pipeline especially for bottlenecks and optimize accordingly.
In response to these two factors, Inspur uses the following tuning methods:
1. Tune hyperparameters including learning rate, batch size, and optimizer to reduce the number of steps for the ResNet50 model from 41 to 35, achieving a performance increase of about 15%.
2. Optimize DALI to accelerate decoding and data processing with GPU resources, achieving a performance increase of about 1%.
3. Improve GPU-GPU communication efficiency with NCCL to accelerate training, achieving a performance increase of about 0.1%.
Detailed tuning processes are described below.
Reading Training Set
The training set is specified by the MLPerf organization. As mentioned above, the cost of reading the images depends on the drive reading speed and transmission bandwidth. Better drives can offer faster speeds. Using RAID 0 can also boost reading speed. We have tested the training speed on two different drives with the same RAID 0 configuration, and the results differ by about 5‰. So it is important to choose the right drive.
Decoding and Processing Data
Decoding and data processing is usually performed concurrently after data reading is complete. Decoding images is time-consuming and often leads to bottlenecks. CPUs are generally used to decode images, but the performance is lackluster with limited resources. We use the NVIDIA Data Loading Library (DALI) framework, a highly optimized execution engine to accelerate computer vision deep learning applications, to decode images and preprocess data with GPU resources for a fourfold performance increase over the original framework, which makes DALI is a good choice for data preprocessing.
After the preprocessing method is selected, we need to optimize it to maximize its benefits for the system and data. First, we need to find the limit for data preprocessing. Use simulated fitting data as the training data to eliminate the cost of data reading and preprocessing and test only the training throughput. Then adjust the DALI parameters to make the throughput for real data closer to that of the fitting data.
We can do so by adjusting the following parameters:
DALI compute allocation: DALI can allocate preprocessing operations to CPUs and GPUs in a specified ratio. If the GPU ratio is too low, resource utilization is not optimized. If the GPU ratio is too high, resources for training may be used.
DALI threads: If the number of threads is too large, resources will be taken up and some threads will be put to wait. If the number is too small, resource utilization is not optimized.
DALI data prefetching: If little data is prefetched, subsequent operations will spend time waiting for data. If too much data is prefetched, it will take up too many GPU storage and computing resources and even cause GPU to run out of memory.
Combined function: Decoding and randomly cropping the image with the ImageDecoderRandomCrop function is usually much faster than executing two separate functions.
The first 3 parameters need to be adjusted and tested based on hardware and model to find the optimal combination. A performance increase of about 7‰ can be achieved by tuning the first 3 parameters. Leveraging combined functions can usually lift performance by 1%.
The key to the above optimizations is to implement your own Pipeline class. Example of key code for ResNet50 data preprocessing:
Figure 5. Key code for ResNet50 data preprocessing
Training Framework Selection
There are numerous training frameworks, such as TensorFlow, PyTorch, and MXNet. The performance results of different models vary with different frameworks. After doing many comparisons, we found that MXNet offers the best performance for ResNet50 training.
In addition, if multiple GPUs are used for training, a large amount of data transmission is required, and Horovod is used with these frameworks or NCCL is directly used for distributed training. Essentially, Horovod also uses NCCL for data transmission. Some frameworks in the sample code provided by MLPerf offer default NCCL parameter settings, which may be different for different hardware. For example, MAXCHANNEL is 32 for the latest NVSwitch architecture and its optimal default value is 16 for the previous NVLink architecture. In most cases, the default values in NCCL can meet the requirements, but their impacts on the transmission speed also matter. Additionally, we have tested that the latest version of NCCL may not be the fastest version for every hardware configuration. You can test the performance with NCCL_TEST.
One of the key elements in training is hyperparameter tuning. The right set of hyperparameters can reduce the number of epochs required, which will improve performance. If 2 runners are going down a hill at the same speed, one of them finds a 10 km path and the other finds an 8 km path, there is no doubt that the one who takes the 8 km path has an advantage and will reach the finish line faster. Likewise, if two submitters have the same training throughput, the model of the first submitter requires 10 epochs to reach the target accuracy, and that of the second submitter only needs 8 epochs, the second submitter will have a faster result. So having the right set of hyperparameters is a key element to increase performance. In fact, to avoid unfairness caused by taking the "wrong path", MLPerf Training has a Hyperparameter Borrowing rule, which allows submitters to use hyperparameters from another submitter's implementation.
Finding the "right path" is not easy. The following are some hyperparameter tuning tips:
Learning rate: Learning rate has impacts on both convergence speed and accuracy. However, tuning learning rate can be annoying, because the gradient may never converge due to learning rate being too high or too slow. Generally, coarse tuning is used before fine tuning for hyperparameters such as learning rate. First, adjust the learning rate by orders of 10 such as 0.01, 0.1, and 1. Once the optimal range is determined, fine tune the learning rate by 10% of the base value.
Batch size: Increasing the batch size can usually improve training speed and the utilization of AI accelerators. But a large batch size can cause the system to run out of memory and the accuracy to decrease. What about a small batch size then? A small batch size has been tested to result in a lower accuracy. So batch size also needs tuning. In addition, batch size and learning rate can affect each other. Generally, if you increase the batch size, you should also increase the learning rate.
Optimizer: The most common optimizer for classification models is stochastic gradient descent (SGD). Using optimizers such as Adam can achieve a faster speed, but usually at the cost of lower accuracy. Layer-wise adaptive rate scaling (LARS, https://arxiv.org/abs/1708.03888) is also a popular optimizer used by MLPerf submitters. The formula of LARS is as follows:
LARS is an extension of SGD with momentum, which adapts a learning rate per layer. It allows each layer of the network to dynamically adjust the learning rate based on its conditions. This solves the instability issue caused by a high learning rate at early stages of training with a large batch size.
With the above hyperparameter tuning methods, we have reduced the number of epochs required for the ResNet50 model from 41 to 35, achieving a performance increase of about 15%. Choosing the "right path" can improve performance significantly.
In summary, lots of factors can influence the training performance. In this article, we use the ResNet50 model from MLPerf Training v1.0 benchmarks as an example to describe how to improve training speed with hardware and software optimization in terms of data processing, training frameworks, and hyperparameters. The optimization code from Inspur has been shared to GitHub (see Appendix 1). Feel free to try it yourself. We hope this article can help you improve your model performance.
After over 3 years of development, the MLPerf benchmark has become a mature standard for evaluating the performance of various AI computing platforms in real-life scenarios with up-to-date models. MLPerf has an open community that drives the development of AI technologies thanks to submitters' contribution of their optimization methods. Inspur has shared the ResNet convergence optimization method used in MLPerf v0.7 in the community, and the method has been widely adopted in the v1.0 benchmarks. In the future, mainstream chip and system providers including Google, NVIDIA, Intel, Inspur, and Dell will continue contributing to MLPerf with their software and hardware optimization methods, improving the performance of AI computing platforms, and laying a solid foundation for the application of AI technologies in more scenarios.
Code from Inspur: https://github.com/mlcommons/training_results_v1.0/tree/master/Inspur/benchmarks/resnet/implementations/mxnet
Follow the steps below to build the environment:
- Download the above code.
- Download the required data specified in README.MD, and preprocess the data according to Appendix 2 to generate a MXNet dataset.
- Go to the mxnet directory and use docker to build the required image. For example:
docker build --pull -t image_name:image_version .
- After the image is built, modify the parameters for your system in the configuration file config_NF5688M6.sh (enter the optimal values obtained by following the tuning methods).
- The software environment has been built. You can start training. For example, our system is "NF5688M6":
DGXSYSTEM="NF5688M6" CONT= image_name:image_version DATADIR=/path/to/preprocessed/data LOGDIR=/path/to/logfile ./run_with_docker.sh
After the training is completed, use the time values of "run_stop" and "run_start" in the log to calculate the training time (in seconds).