Follow Inspur

Training Yuan 1.0 – a Massive Chinese Language Model with 245.7 Billion Parameters

By Inspur Artificial Intelligence Research Institute

  Language models have seen exponential growth in recent years. The 2018 introduction of BERT had 340 million parameters, and just two years later in 2020, GPT-3 was introduced with 175 billion. The number of parameters jumped again in September 2021 when the Inspur Artificial Intelligence Research Institute (Inspur AI Research) unveiled Yuan 1.0. It was released as the largest Chinese language AI model in the world with 245.7 billion parameters and 5TB of datasets. Yuan 1.0 showcased impressive performance in language processing and ranked first in CLUE (Chinese language understanding evaluation) zero-shot and few-shot learning benchmarks, ZeroCLUE and FewCLUE. Most impressively, Yuan 1.0 has the ability to generate written content that human testers can only distinguish from human-generated content less than 50% of the time.


  The advanced language capabilities of Yuan 1.0 necessitates a high number of parameters, which brings many challenges in model training and deployment. This article focuses on the computing challenges of Yuan 1.0 and the training methods used.


  1. Yuan 1.0 Model Architecture

  The basic architecture for Yuan 1.0 is a language model. Therefore, we consider two model architectures, Language Model (LM) and Prefix Language Model (PLM). In the pre-train and fine-tune pattern, a LM performs better in natural language generation (NLG) tasks, but comparatively worse in natural language understanding (NLU) tasks. In contrast, PLM performs well in both NLU tasks and NLG tasks. The structure of a LM model and PLM model are presented in Fig. 1.


  Figure 1. Model architecture (LM shown at left, PLM shown at right)

  When comparing the downstream task performance of Yuan LM-13B and Yuan PLM-13B, it was found that the Yuan LM-13B performs better on Zero-Shot learning, while Yuan PLM-13B outperforms with fine-tune. Fine-tune in general brings better accuracy in most tasks. However, fine-tune costs tremendous computational resources for Yuan 245B model, which makes fine-tune uneconomic. So, we choose LM as basic architecture of Yuan 245B model.


  2. Training Yuan 1.0

  2.1. Challenges

  The first problem to be solved in training "Yuan 1.0" is the huge amount of data and computing required for model training


  If there is not sufficient data for training the model, the performance of the model will not perform well. A Chinese corpus with 5TB high-quality text is built, which is sufficient to train Yuan 245B model without sampling the dataset twice.


  Figure 2. Data pre-processing procedure

  For computing power, estimates for compute requirements to train Yuan 1.0 were based on OpenAI’s petaflop/s-day (pfs-day) standard. According to Wikipedia (, training GPT-3 required 3640 pfs-day, which would translate to about 1 year of training on 64 A100 GPUs. And Yuan 1.0 would require up to 4095 pfs-day.


  Training a model with parameters greater than 100B requires huge amount of computational resources. Take GPT-3 175B for example, it was trained on a cluster of 10,000 GPUs. Such a huge requirement on computational resources makes it difficult for most researchers to train a model in a similar way. To accelerate the training process of Yuan 1.0, and thus reduce energy costs and carbon emissions, we make a collaborative design of model architecture and large-scale distributed training.


  Data parallelism is usually used for distributed model training, but it can not solve the problem of limited memory capacity in large model training. Training a model with parameters greater than 100B requires huge amount of computational resources. A specially designed of distributed training strategy is required to address the GPU and its memory limit problems.



  2.2. Strategies

  To address the GPU memory challenge, tensor, pipeline, and data parallelism strategies were combined and Yuan 1.0 was trained across 2128 GPUs with 180 billion tokens.


  i. Tensor Parallelism

  2D tensors and data parallelism provide one solution to train models that do not fit in the available memory of a single GPU. For this strategy, GPUs are combined into groups. For example, a group can include 8 GPUs in a server. Tensor parallelism is used to split the model within each group, and data parallelism is employed among groups (servers).

  NVIDIA has released a transformer-based tensor parallelism solution in Megatron-LM, where the parameters and computation of each block are evenly split over N GPUs to maximize GPU utilization. Figure 3 shows the tensor parallel forward computing in the MLP and self-attention layers of the transformer block.


  Figure 3. Tensor parallelism



  where "s" is the sequence length and h is the hidden size.

  ii. Pipeline Parallelism


  Figure 4. Pipeline parallelism

  For language models with hundreds of billions parameters, the parameters can hardly be stored in a single node. Pipeline parallelism splitting the layers of LM among multiple nodes, is applied to solve the above mentioned problem (Figure 4). Each node is one stage in the pipeline, which receives outputs from the previous stage and sends results to the next one. A node will be idle if the inputs received from its previous neighbor is not ready. The idle time for a pipeline is called pipeline bubble. To increase the performance of pipeline parallelism, we have to decrease the time spent on pipeline bubble. The fraction of ideal time spent in the pipeline bubble   is as follows:


  According to above equation, the time spent on pipeline bubble increases with the number of layers L, and decreases with the number of micro-batch size m. There will be a better performance if  

  In pipeline parallelism, the ratio of the computation time to data communication time per nodeis:


  the computational efficiency of a pipeline node improves with the increase of the values of h and S, which is similar to the situation of tensor parallelism

  iii. Data Parallelism


  Figure 6. Data parallelism

  The global batch size is split among pipeline groups by data parallelism Each pipeline group with a copy of the model is fed by local batches. In data parallelism, the ratio of computing time to communication time  is:


  If  , the formula can be simplified to:


  The computing efficiency improves with the increase of the global batch size B and sequence length S. Because the memory requirements is quadric to sequence length S, increasing the global batch size seems to be a more effective way. However, there will be numerical instabilities during training when global batch size is too large.


  However, models may fail to converge if the global batch size is too large. To ensure training stability, the global batch size was kept smaller than  tokens.


  Based on the above analysis, the following rules were applied for Yuan 1.0:

  •   Increase the sequence length as much as possible, as it benefits the tensor parallelism, pipeline parallelism, and data parallelism. Because the memory requirement is quadric to the sequence length, it is worthy to re-compute activations in the backward propagation to save memories.
  •   Too many layers in language model have negative effect in performance, because it increases the time spent on pipeline bubble.
  •   Increasing the hidden size improves the performance of both tensor parallelism and pipeline parallelism.
  •   Increasing the number of micro batches in a node improves the performance of pipeline parallelism. Increasing the global batch size improves the performance of data parallelism.


  With the above design principles, the following architecture and distributed training configurations were used for Yuan 1.0:



 Based on the architecture and cluster hardware features, the following node configurations and distributed training strategies were selected:

  •   Yuan 1.0 was trained across 2128 GPUs.
  •   The cluster was divided into 7 groups, each containing 38 AI servers. Each group have a full copy of the Yuan 1.0 model. Data parallelism was used among the 7 groups.
  •   Pipeline parallelism was employed inter-nodes, with each server running a batch that was 1/38 of the model (2 transformer layers, 76 layers in total).
  •   Tensor parallelism was applied intra-node, and the workload of each transformer layer was evenly split over intra-node GPUs.


  The following diagram shows the model convergence: 



  For more information on Yuan 1.0, view the paper published by Inspur AI Research on arxiv:

  A more detailed “White Paper on NLP Model Training Solutions” is also available. Follow the “Inspur AIHPC” Official Account on WeChat and reply with “NLP white paper” to download.

Related Blog