M3E Homepage, Documentation and Download – Open Source Chinese Embedding Model New SOTA – News Fast Delivery

M3E is the abbreviation of Moka Massive Mixed Embedding

Moka, this model is trained by MokaAI, open source and evaluation, used by training script uniem to evaluate BenchMark using MTEB-en

Massive, this model passestens of millions (2200w+) Chinese sentences are trained on the data set

Mixed, this model supports bilingual homogeneous text similarity calculation, heterogeneous text retrieval and other functions, and will also support code retrieval in the future

Embedding, this model is a text embedding model that can convert natural language into a dense vector

The data sets used by M3E models include a large number of non-commercial data sets, so M3E models are also non-commercial and are for research use only. Officially marked commercial and non-commercial data sets on the M3E data set, users can train themselves according to their own needs.

Model comparison

	number of parameters	dimension	Chinese	English	s2s	s2p	s2c	open source	compatibility	s2s Acc	s2p ndcg@10
m3e-small	24M	512	yes	no	yes	no	no	yes	excellent	0.5834	0.7262
m3e-base	110M	768	yes	yes	yes	yes	no	yes	excellent	0.6157	0.8004
text2vec	110M	768	yes	no	yes	no	no	yes	excellent	0.5755	0.6346
openai-ada-002	unknown	1536	yes	yes	yes	yes	yes	no	excellent	0.5956	0.7786

illustrate:

s2s, sentence to sentence, represents the embedding ability between homogeneous texts, applicable tasks: text similarity, duplicate question detection, text classification, etc.

s2p, namely sentence to passage, represents the embedding ability between heterogeneous texts, applicable tasks: text retrieval, GPT memory module, etc.

s2c, namely sentence to code, represents the embedding ability between natural language and programming language, applicable tasks: code retrieval

Compatibility represents the degree to which the model is supported by various projects in the open source community. Since both m3e and text2vec can be used directly through sentence-transformers, it is comparable to openai in terms of community support

ACC & ndcg@10, see the comments below for details

Tips:

The usage scenarios are mainly in Chinese and a small amount of English, it is recommended to use the m3e series of models

If you are using multiple languages and don’t mind data privacy, I recommend using openai-ada-002

For code retrieval scenarios, ada-002 is recommended

training program

M3E uses the contrastive learning method of in-batch negative sampling to train on the sentence pair data set. In order to ensure the effect of in-batch negative sampling, we use A100 80G to maximize the batch-size, and a total of 2200W+ sentence pair data sets Last trained for 1 epoch.The training script uses uniemyou can see the details here.

characteristic

Chinese training set, M3E training on a large-scale sentence pair data set, including Chinese encyclopedia, finance, medical care, law, news, academic and other fields with a total of 22 million sentence pair samples. For details of the data set, see M3E dataset

English training set, M3E uses the MEDI 145W English triplet data set for training, see the data set for details MEDI datasetthis dataset consists of instructor team supply

Instruction data set, M3E uses a 300W+ instruction fine-tuning data set, which enables M3E to follow instructions when encoding text. This part of the work is mainly inspired by instructor-embedding

Base model, M3E using hfl lab’s Roberta A series of models for training, currently provide two versions of small and base, you need to choose

ALL IN ONE, M3E aims to provide an ALL IN ONE text embedding model, which not only supports similarity judgment of homogeneous sentences, but also supports heterogeneous text retrieval. You only need one model to cover all application scenarios, and it will be supported in the future code retrieval

review

#M3E #Homepage #Documentation #Download #Open #Source #Chinese #Embedding #Model #SOTA #News Fast Delivery

Model comparison

training program

characteristic

review

Leave a Comment Cancel Reply