Inspur HF18000G5 high-end all-flash storage is designed for medium/large-sized enterprises and is oriented to structured and unstructured.
LEARN MORELearn what's at the heart of new hybrid cloud platforms.Extend the cloud experience across your business in away that‘s open flexible, and hybrid by design.
Read the INSPUR paperNovember 18-21 | Colorado Convention Centre, Denver, USA
Thinking about attending SC19?Register to get a free pass and join Inspur at the conference for all things HPC.By Inspur Artificial Intelligence Research Institute
Language models have seen exponential growth in recent years. The 2018 introduction of BERT had 340 million parameters, and just two years later in 2020, GPT-3 was introduced with 175 billion. The number of parameters jumped again in September 2021 when the Inspur Artificial Intelligence Research Institute (Inspur AI Research) unveiled Yuan 1.0. It was released as the largest Chinese language AI model in the world with 245.7 billion parameters and 5TB of datasets. Yuan 1.0 showcased impressive performance in language processing and ranked first in CLUE (Chinese language understanding evaluation) zero-shot and few-shot learning benchmarks, ZeroCLUE and FewCLUE. Most impressively, Yuan 1.0 has the ability to generate written content that human testers can only distinguish from human-generated content less than 50% of the time.
The advanced language capabilities of Yuan 1.0 necessitates a high number of parameters, which brings many challenges in model training and deployment. This article focuses on the computing challenges of Yuan 1.0 and the training methods used.
The basic architecture for Yuan 1.0 is a language model. Therefore, we consider two model architectures, Language Model (LM) and Prefix Language Model (PLM). In the pre-train and fine-tune pattern, a LM performs better in natural language generation (NLG) tasks, but comparatively worse in natural language understanding (NLU) tasks. In contrast, PLM performs well in both NLU tasks and NLG tasks. The structure of a LM model and PLM model are presented in Fig. 1.
Figure 1. Model architecture (LM shown at left, PLM shown at right)
When comparing the downstream task performance of Yuan LM-13B and Yuan PLM-13B, it was found that the Yuan LM-13B performs better on Zero-Shot learning, while Yuan PLM-13B outperforms with fine-tune. Fine-tune in general brings better accuracy in most tasks. However, fine-tune costs tremendous computational resources for Yuan 245B model, which makes fine-tune uneconomic. So, we choose LM as basic architecture of Yuan 245B model.
2.1. Challenges
The first problem to be solved in training "Yuan 1.0" is the huge amount of data and computing required for model training
If there is not sufficient data for training the model, the performance of the model will not perform well. A Chinese corpus with 5TB high-quality text is built, which is sufficient to train Yuan 245B model without sampling the dataset twice.
Figure 2. Data pre-processing procedure
For computing power, estimates for compute requirements to train Yuan 1.0 were based on OpenAI’s petaflop/s-day (pfs-day) standard. According to Wikipedia (https://en.wikipedia.org/wiki/OpenAI), training GPT-3 required 3640 pfs-day, which would translate to about 1 year of training on 64 A100 GPUs. And Yuan 1.0 would require up to 4095 pfs-day.
Training a model with parameters greater than 100B requires huge amount of computational resources. Take GPT-3 175B for example, it was trained on a cluster of 10,000 GPUs. Such a huge requirement on computational resources makes it difficult for most researchers to train a model in a similar way. To accelerate the training process of Yuan 1.0, and thus reduce energy costs and carbon emissions, we make a collaborative design of model architecture and large-scale distributed training.
Data parallelism is usually used for distributed model training, but it can not solve the problem of limited memory capacity in large model training. Training a model with parameters greater than 100B requires huge amount of computational resources. A specially designed of distributed training strategy is required to address the GPU and its memory limit problems.
2.2. Strategies
To address the GPU memory challenge, tensor, pipeline, and data parallelism strategies were combined and Yuan 1.0 was trained across 2128 GPUs with 180 billion tokens.
i. Tensor Parallelism
2D tensors and data parallelism provide one solution to train models that do not fit in the available memory of a single GPU. For this strategy, GPUs are combined into groups. For example, a group can include 8 GPUs in a server. Tensor parallelism is used to split the model within each group, and data parallelism is employed among groups (servers).
NVIDIA has released a transformer-based tensor parallelism solution in Megatron-LM, where the parameters and computation of each block are evenly split over N GPUs to maximize GPU utilization. Figure 3 shows the tensor parallel forward computing in the MLP and self-attention layers of the transformer block.
Figure 3. Tensor parallelism
where "s" is the sequence length and h is the hidden size.
ii. Pipeline Parallelism
Figure 4. Pipeline parallelism
For language models with hundreds of billions parameters, the parameters can hardly be stored in a single node. Pipeline parallelism splitting the layers of LM among multiple nodes, is applied to solve the above mentioned problem (Figure 4). Each node is one stage in the pipeline, which receives outputs from the previous stage and sends results to the next one. A node will be idle if the inputs received from its previous neighbor is not ready. The idle time for a pipeline is called pipeline bubble. To increase the performance of pipeline parallelism, we have to decrease the time spent on pipeline bubble. The fraction of ideal time spent in the pipeline bubble is as follows:
According to above equation, the time spent on pipeline bubble increases with the number of layers L, and decreases with the number of micro-batch size m. There will be a better performance if
In pipeline parallelism, the ratio of the computation time to data communication time per nodeis:
the computational efficiency of a pipeline node improves with the increase of the values of h and S, which is similar to the situation of tensor parallelism
iii. Data Parallelism
Figure 6. Data parallelism
The global batch size is split among pipeline groups by data parallelism Each pipeline group with a copy of the model is fed by local batches. In data parallelism, the ratio of computing time to communication time is:
If , the formula can be simplified to:
The computing efficiency improves with the increase of the global batch size B and sequence length S. Because the memory requirements is quadric to sequence length S, increasing the global batch size seems to be a more effective way. However, there will be numerical instabilities during training when global batch size is too large.
However, models may fail to converge if the global batch size is too large. To ensure training stability, the global batch size was kept smaller than tokens.
Based on the above analysis, the following rules were applied for Yuan 1.0:
With the above design principles, the following architecture and distributed training configurations were used for Yuan 1.0:
Based on the architecture and cluster hardware features, the following node configurations and distributed training strategies were selected:
The following diagram shows the model convergence:
For more information on Yuan 1.0, view the paper published by Inspur AI Research on arxiv: https://arxiv.org/abs/2110.04725.
A more detailed “White Paper on NLP Model Training Solutions” is also available. Follow the “Inspur AIHPC” Official Account on WeChat and reply with “NLP white paper” to download.
Inspur AIStation Empowers Efficient GPU Resource Sharing
AIStation is an Inspur-developed AI development platform specifically designed to deal with these issues by offering an easy to set up and refined GPU resource scheduling system.
Inspur,AI
Inspur joined with Xishuangbanna National Nature Reserve to develop an extensive technology system for the conservation of some 300 Asian elephants in Yunnan, China.
Inspur,AI,
Uncovering the ancient past with Inspur AI and biomolecular archaeology
Inspur teams up with DNA lab to trace the origin of human civilization
Inspur,AI,
Archaeology and AI Unlock the Secrets of Our Ancient History
With the help of today’s intelligent computing, researchers are now more easily able to find out more about our world from critically examining the artifacts of the past.
Inspur,Server,
By Arthur Kang, SPEC OSSC Member / OSG ML Chair, Performance Architect, Inspur Information
Inspur,Server
Training Yuan 1.0 – a Massive Chinese Language Model with 245.7 Billion Parameters
The advanced language capabilities of Yuan 1.0 necessitates a high number of parameters, which brings many challenges in model training and deployment. This article focuses on the computing challenges of Yuan 1.0 and the training methods used.
AI, Yuan
Performance Evaluation of Weather and Climate Prediction Applications on Intel's Ice Lake Processor
The amazing enhancements of Intel's new third-generation Xeon Scalable processors (Ice Lake),
AI,HPC
A Deep Analysis on Optimal Single Server Performance of ResNet50 in MLPerf Training Benchmarks
lots of factors can influence the training performance. In this article, we use the ResNet50 model from MLPerf Training v1.0 benchmarks as an example to describe how to improve training speed with hardware and software optimization
Server
Integrates hardware and software that helps data centers improve construction efficiency, simplifies operations management, and enhances operational efficiency.
Open
Standardization for Advancing Heterogeneous AI Computing Platforms
New specification provides current system compatibility and a framework for mixed accelerator hardware applications
AI
The Future of Data Centers Is Greener, More Open Computing
As basic IT infrastructure shifts from the on-prem private IT datacenter to the public/hybrid cloud with the growing demand for more computing performance, the world is facing a new challenge
Data
The Evolution of the Rack-scale System in the 5G Era
As the world enters the 5G era, the common impression is that 5G will bring down consumer prices of information services.
Open