Pre-Training with Whole Word Masking for Chinese BERT (Chinese BERT-wwm series model)

In the field of natural language processing, pre-trained language models (Pre-trained Language Models) have become a very important basic technology. In order to further promote the research and development of Chinese information processing, Harbin Institute of Technology Xunfei Joint Laboratory (HFL) released the Chinese pre-training model BERT-wwm based on Whole Word Masking technology, and models closely related to this technology: BERT-wwm-ext, RoBERTa-wwm-ext, RoBERTa-wwm-ext-large, RBT3, RBTL3.

Whole Word Masking (wwm)temporarily translated as全词Maskor整词Mask, is an upgraded version of BERT released by Google on May 31, 2019, which mainly changes the training sample generation strategy in the original pre-training stage. To put it simply, the original WordPiece-based word segmentation method will divide a complete word into several subwords. When generating training samples, these divided subwords will be randomly masked.exist全词MaskIn , if a WordPiece subword of a complete word is masked, other parts of the word will also be masked, that is全词Mask.

It should be noted that the mask here refers to the generalized mask (replaced by[MASK]; keep the original vocabulary; randomly replace with another word), not limited to words replaced by[MASK]The case of the label. For more detailed instructions and examples, please refer to:#4

Similarly, due to Google’s officialBERT-base, ChineseChinese is based onCharacterSegmentation for granularity does not take into account the Chinese word segmentation (CWS) in traditional NLP.HFL The method of full-word Mask is applied to Chinese, Chinese Wikipedia (including simplified and traditional) is used for training, and theHarbin Institute of Technology LTPAs a word segmentation tool, that is, to form the samewordAll Chinese characters are masked.

The following text shows全词MaskGenerated samples of . Note: For ease of understanding, in the following examples only the replacement is considered[MASK]The case of the label.








illustratesample
original textUse a language model to predict the probability of the next word.
participle textUse a language model to predict the probability of the next word.
Original Mask inputuse language [MASK] type to [MASK] Test the pro of the next word [MASK] ##lity.
Full word Mask inputuse language [MASK] [MASK] Come [MASK] [MASK] the next word [MASK] [MASK] [MASK] .

Chinese model download

This directory mainly contains the base model, so HFL is not marked in the model abbreviationbasetypeface. Models of other sizes will be marked with corresponding marks (such as large).

  • BERT-large模型:24-layer, 1024-hidden, 16-heads, 330M parameters
  • BERT-base模型:12-layer, 768-hidden, 12-heads, 110M parameters















model abbreviationcorpusGoogle downloadXunfei cloud download
RBT6, ChineseEXT data[1]TensorFlow (password XNMA)
RBT4, ChineseEXT data[1]TensorFlow (password e8dN)
RBTL3, ChineseEXT data[1]TensorFlow

PyTorch
TensorFlow (password vySW)
RBT3, ChineseEXT data[1]TensorFlow

PyTorch
TensorFlow (password b9nx)
RoBERTa-wwm-ext-large, ChineseEXT data[1]TensorFlow

PyTorch
TensorFlow (password u6gC)
RoBERTa-wwm-ext, ChineseEXT data[1]TensorFlow

PyTorch
TensorFlow (password Xe1p)
BERT-wwm-ext, ChineseEXT data[1]TensorFlow

PyTorch
TensorFlow (password 4cMG)
BERT-wwm, ChineseChinese WikiTensorFlow

PyTorch
TensorFlow (password 07Xj)
BERT-base, ChineseGoogleChinese WikiGoogle Cloud
BERT-base, Multilingual CasedGoogleMultilingual WikiGoogle Cloud
BERT-base, Multilingual UncasedGoogleMultilingual WikiGoogle Cloud

[1] EXT data includes: Chinese Wikipedia, other encyclopedias, news, Q&A and other data, with a total word count of 5.4B.

#ChineseBERTwwm #Homepage #Documentation #Download #Chinese #BERTwwm #Series #Models #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *