analysis-ik
analysis-ik copied to clipboard
英文分词,英文句号和句号之前的单词分到一起了
用 ik_smart 分词英文的时候,英文句号 . 和 . 之前的单词分到一起了。样例如下:
GET /_analyze
{
"analyzer": "ik_smart",
"text": "In 1997, a group of twenty British women made history. Working "
}
分词结果是:
{
"tokens" : [
{
"token" : "1997",
"start_offset" : 3,
"end_offset" : 7,
"type" : "LETTER",
"position" : 0
},
{
"token" : "group",
"start_offset" : 11,
"end_offset" : 16,
"type" : "ENGLISH",
"position" : 1
},
{
"token" : "twenty",
"start_offset" : 20,
"end_offset" : 26,
"type" : "ENGLISH",
"position" : 2
},
{
"token" : "british",
"start_offset" : 27,
"end_offset" : 34,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "women",
"start_offset" : 35,
"end_offset" : 40,
"type" : "ENGLISH",
"position" : 4
},
{
"token" : "made",
"start_offset" : 41,
"end_offset" : 45,
"type" : "ENGLISH",
"position" : 5
},
{
"token" : "history.",
"start_offset" : 46,
"end_offset" : 54,
"type" : "LETTER",
"position" : 6
},
{
"token" : "working",
"start_offset" : 55,
"end_offset" : 62,
"type" : "ENGLISH",
"position" : 7
}
]
}
. 号和 . 号之前的 history 分到一起了,成了 history. 了。
是的,ik_smart 对于英文真的太鸡肋了,都是以空格进行分词,比如history.Working
直接是history.Working
解决办法,自定义分词器:
"my_ik_smart": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": [
"stemmer"
],
"char_filter": [
"dot_char_filter"
]
},
"char_filter": {
"dot_char_filter": {
"type": "pattern_replace",
"pattern": "\\.",
"replacement": " "
}
}