analysis-ik
analysis-ik copied to clipboard
中文分词问题【中国大陆中国香港】
ik_smart 和 ik_max_word 都无法分出【中国香港】为一个单独的词,但是却可以分出【中国大陆】。这样会导致搜索【中国香港】 的时候无法命中记录。
GET /
{
"name" : "node3",
"cluster_name" : "es",
"cluster_uuid" : "llwb1PwMT-yOdsAZRogoZw",
"version" : {
"number" : "6.8.5",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "78990e9",
"build_date" : "2019-11-13T20:04:24.100411Z",
"build_snapshot" : false,
"lucene_version" : "7.7.2",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
GET /_cat/plugins
node2 analysis-ik 6.8.5
node1 analysis-ik 6.8.5
node3 analysis-ik 6.8.5
POST _analyze
{"text": "中国大陆中国香港", "analyzer":"ik_smart" }
{
"tokens" : [
{
"token" : "中国大陆",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "中国",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "香港",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 2
}
]
}
POST _analyze
{"text": "中国大陆中国香港", "analyzer":"ik_max_word" }
{
"tokens" : [
{
"token" : "中国大陆",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "中国",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "国大",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "大陆",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "陆",
"start_offset" : 3,
"end_offset" : 4,
"type" : "TYPE_CNUM",
"position" : 4
},
{
"token" : "中国",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "中",
"start_offset" : 4,
"end_offset" : 5,
"type" : "COUNT",
"position" : 6
},
{
"token" : "国",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_CHAR",
"position" : 7
},
{
"token" : "香港",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 8
}
]
}
可以自己加一个词条