idea

ideaseg It is a Chinese tokenizer based on the latest HanLP natural language processing toolkit, which contains the latest model data, and removes the non-commercial-friendly licensed NeuralNetworkParser related code and data included in HanLP.

HanLP compared to others such as IK,jcseg As far as the tokenizer is concerned, the accuracy of word segmentation has been greatly improved, but the speed has been sacrificed.by right HanLP To optimize the configuration,ideaseg It has achieved the best balance between accuracy and word segmentation speed.

with other based HanLP compared to the plugin,ideaseg Synchronized the latest HanLP The code and data that cannot be used for commercial use are removed; automatic configuration is realized; model data is included, no need to download by yourself, and it is simple and convenient to use.

ideaseg Three modules are offered including:

  1. core ~ Core tokenizer module
  2. elasticsearch ~ ideag word segmentation plugin for ElasticSearch (up to version 7.10.2)
  3. opensearch ~ ideag word segmentation plugin for OpenSearch (default version 2.4.1)

about ElasticSearch According to the version description, since Elastic has modified the ES license since version 7.11.1, and modified the permission policy of the plug-in at the same time, the plug-in is no longer allowed to read and write files.because HanLP The model data itself is very large. In order to improve the speed, its processing mechanism needs to generate some files equivalent to cache in the data directory of the plug-in.Therefore, if you are using ElasticSearch Please try to use version 7.10.2 or below, it is recommended OpenSearch .

also data Include HanLP model data.

because ElasticSearch The plug-in mechanism of the engine is strictly bound to the version of the engine itself, and there are many versions, so this project does not provide a pre-compiled binary version, you need to download the source code to build it yourself.

Construct

The following is the construction process of the plug-in, please install it before starting git,java,maven and other related tools.

First make sure your ElasticSearch or OpenSearch The specific version, assuming you are using ElasticSearch Version 7.10.2, please use a text editor to open ideaseg/elasticsearch/pom.xml file, modify elasticsearch.version The corresponding value is 7.10.2
(in the case of OpenSearch please modify opensearch/pom.xml).

Save the file and open the command line window, execute the following command to start the build:

$ git clone https://gitee.com/indexea/ideaseg
$ cd ideaseg
$ mvn install

After the build is complete, the elasticsearch/target and opensearch/target Generate two plugin files each as ideaseg.zip .

Install

After the build is complete, we can use the ElasticSearch or OpenSearch Provided plugin management tool for installation.

ElasticSearch The corresponding plug-in management tool is <elasticsearch>/bin/elasticsearch-plugin and OpenSearch The corresponding management tool is <opensearch>/bin/opensearch-plugin.in <elasticsearch> and <opensearch> The directory where the two services are installed.

ElasticSearch install ideag plug-in

$ bin/elasticsearch-plugin install file:///<ideaseg>/elasticsearch/target/ideaseg.zip

OpenSearch install ideag plugin

$ bin/opensearch-plugin install file:///<ideaseg>/opensearch/target/ideaseg.zip

in <ideaseg> for ideaseg The path where the source code is located.Pay special attention to the fact that there must be a path before file:// .

During the installation process, you will be asked for the permissions required by the plug-in. Press Enter to confirm to complete the installation. After the installation, you need to restart the service to make the plug-in take effect.

Next, you can use the word segmentation test tool to test the plugin, as follows:

POST _analyze
{
  "analyzer": "ideaseg",
  "text":     "你好,我用的是 ideaseg 分词插件。"
}

For details about the word segmentation test, please refer to the official ElasticSearch documentation.

Feedback question

if you are using ideaseg If you have any questions during the process, please submit them through Issues.

Special thanks to

https://github.com/KennFalcon/elasticsearch-analysis-hanlp

#ideag #Chinese #word #segmentation #plugin #based #NLP #technology #accuracy #higher #commonly #word #segmenters #ElasticSearch #OpenSearch #plugins

Leave a Comment

Your email address will not be published. Required fields are marked *