当前位置

网站首页> 程序设计 > 开源项目 > 程序开发 > 浏览文章

Elasticsearch添加中文分词

作者:小梦 来源: 网络 时间: 2024-05-05 阅读:

安装IK分词插件

GitHub上下载项目(我下载到了/tmp下),并解压

cd /tmpwget https://github.com/medcl/elasticsearch-analysis-ik/archive/master.zipunzip master.zip

进入elasticsearch-analysis-ik-master

cd elasticsearch-analysis-ik/

然后使用mvn命令,编译出jar包,elasticsearch-analysis-ik-1.4.0.jar,这个过程可能需要多尝试几次才能成功

mvn package

顺便说一下,mvn需要安装maven,在Ubuntu上,安装maven的命令如下

apt-cache search mavensudo apt-get install mavenmvn -version

elasticsearch-analysis-ik-master/下的ik文件夹复制到${ES_HOME}/config/

elasticsearch-analysis-ik-master/target下的elasticsearch-analysis-ik-1.4.0.jar复制到${ES_HOME}/lib

${ES_HOME}/config/下的配置文件elasticsearch.yml中增加ik的配置,在最后增加

index:  analysis:           analyzer:ik:          alias: [ik_analyzer]          type: org.elasticsearch.index.analysis.IkAnalyzerProvider      ik_max_word:          type: ik          use_smart: false      ik_smart:          type: ik          use_smart: trueindex.analysis.analyzer.default.type: ik

同时,还需要在${ES_HOME}/lib中引入httpclient-4.3.5.jarhttpcore-4.3.2.jar

IK分词测试

创建一个索引,名为index

curl -XPUT http://localhost:9200/index

为索引index创建mapping

curl -XPOST http://localhost:9200/index/fulltext/_mapping -d ' {        "fulltext": { "_all": {"analyzer": "ik"        },       "properties": {"content": {    "type" : "string",    "boost" : 8.0,    "term_vector" : "with_positions_offsets",    "analyzer" : "ik",    "include_in_all" : true}        }    }}'

测试

curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d '{   "text":"世界如此之大"}'{  "tokens" : [ {    "token" : "text",    "start_offset" : 4,    "end_offset" : 8,    "type" : "ENGLISH",    "position" : 1  }, {    "token" : "世界",    "start_offset" : 11,    "end_offset" : 13,    "type" : "CN_WORD",    "position" : 2  }, {    "token" : "如此之",    "start_offset" : 13,    "end_offset" : 16,    "type" : "CN_WORD",    "position" : 3  }, {    "token" : "如此",    "start_offset" : 13,    "end_offset" : 15,    "type" : "CN_WORD",    "position" : 4  }, {    "token" : "之大",    "start_offset" : 15,    "end_offset" : 17,    "type" : "CN_WORD",    "position" : 5  } ]}