baichuan-7B is an open source large-scale pre-training model, based on the Transformer structure, a 7 billion parameter model trained on about 1.2 trillion tokens, supports Chinese and English bilingual, and the context window length is 4096.
The overall model is based on the standard Transformer structure, using the same model design as LLaMA
Location code:rotary-embeddingIt is the position encoding scheme adopted by most models at this stage, and has better extension effect. Although the maximum length in the training process is 4096, the model can be well extended to 5000 tokens in the actual test, as shown in the following figure:
Activation layer: SwiGLU, Feedforward changes to (8/3) times the hidden layer size, ie 11008
Layer-Normalization: based onRMS NormPre-Normalization
- The original data includes open source Chinese and English data, self-grabbed Chinese Internet data, and some high-quality knowledge data, with a total of more than 10T.
- Referring to related data work, frequency and quality are the two dimensions that should be considered in the data processing process. Based on heuristic rules and quality model scoring, the original data set is filtered at the granularity of chapters and sentences. On the full amount of data, the local sensitive hash method is used to filter the text and sentence granularity.
The overall process is as follows:
- After continuous adjustments and multiple rounds of testing, a Chinese-English ratio that performs best in downstream tasks was finally confirmed.
- A data weighting strategy based on automatic learning is used to match different types of data.
#baichuan7B #Homepage #Documentation #Download #Open #source #Chinese #English #large #model #News Fast Delivery