M3E is the abbreviation of Moka Massive Mixed Embedding

  • Moka, this model is trained by MokaAI, open source and evaluation, used by training script uniem to evaluate BenchMark using MTEB-en
  • Massive, this model passestens of millions (2200w+) Chinese sentences are trained on the data set
  • Mixed, this model supports bilingual homogeneous text similarity calculation, heterogeneous text retrieval and other functions, and will also support code retrieval in the future
  • Embedding, this model is a text embedding model that can convert natural language into a dense vector

The data sets used by M3E models include a large number of non-commercial data sets, so M3E models are also non-commercial and are for research use only. Officially marked commercial and non-commercial data sets on the M3E data set, users can train themselves according to their own needs.

Model comparison








number of parametersdimensionChineseEnglishs2ss2ps2copen sourcecompatibilitys2s Accs2p ndcg@10
m3e-small24M512yesnoyesnonoyesexcellent0.58340.7262
m3e-base110M768yesyesyesyesnoyesexcellent0.61570.8004
text2vec110M768yesnoyesnonoyesexcellent0.57550.6346
openai-ada-002unknown1536yesyesyesyesyesnoexcellent0.59560.7786

illustrate:

  • s2s, sentence to sentence, represents the embedding ability between homogeneous texts, applicable tasks: text similarity, duplicate question detection, text classification, etc.
  • s2p, namely sentence to passage, represents the embedding ability between heterogeneous texts, applicable tasks: text retrieval, GPT memory module, etc.
  • s2c, namely sentence to code, represents the embedding ability between natural language and programming language, applicable tasks: code retrieval
  • Compatibility represents the degree to which the model is supported by various projects in the open source community. Since both m3e and text2vec can be used directly through sentence-transformers, it is comparable to openai in terms of community support
  • ACC & ndcg@10, see the comments below for details

Tips:

  • The usage scenarios are mainly in Chinese and a small amount of English, it is recommended to use the m3e series of models
  • If you are using multiple languages ​​and don’t mind data privacy, I recommend using openai-ada-002
  • For code retrieval scenarios, ada-002 is recommended

training program

M3E uses the contrastive learning method of in-batch negative sampling to train on the sentence pair data set. In order to ensure the effect of in-batch negative sampling, we use A100 80G to maximize the batch-size, and a total of 2200W+ sentence pair data sets Last trained for 1 epoch.The training script uses uniemyou can see the details here.

characteristic

  • Chinese training set, M3E training on a large-scale sentence pair data set, including Chinese encyclopedia, finance, medical care, law, news, academic and other fields with a total of 22 million sentence pair samples. For details of the data set, see M3E dataset
  • English training set, M3E uses the MEDI 145W English triplet data set for training, see the data set for details MEDI datasetthis dataset consists of instructor team supply
  • Instruction data set, M3E uses a 300W+ instruction fine-tuning data set, which enables M3E to follow instructions when encoding text. This part of the work is mainly inspired by instructor-embedding
  • Base model, M3E using hfl lab’s Roberta A series of models for training, currently provide two versions of small and base, you need to choose
  • ALL IN ONE, M3E aims to provide an ALL IN ONE text embedding model, which not only supports similarity judgment of homogeneous sentences, but also supports heterogeneous text retrieval. You only need one model to cover all application scenarios, and it will be supported in the future code retrieval

review

#M3E #Homepage #Documentation #Download #Open #Source #Chinese #Embedding #Model #SOTA #News Fast Delivery

Leave a Comment

Your email address will not be published. Required fields are marked *