analysis-ik icon indicating copy to clipboard operation
analysis-ik copied to clipboard

英文分词,英文句号和句号之前的单词分到一起了

Open fengsmith opened this issue 3 years ago • 2 comments

用 ik_smart 分词英文的时候,英文句号 . 和 . 之前的单词分到一起了。样例如下:

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "In 1997, a group of twenty British women made history. Working "
}

分词结果是:

{
  "tokens" : [
    {
      "token" : "1997",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "group",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "twenty",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "ENGLISH",
      "position" : 2
    },
    {
      "token" : "british",
      "start_offset" : 27,
      "end_offset" : 34,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "women",
      "start_offset" : 35,
      "end_offset" : 40,
      "type" : "ENGLISH",
      "position" : 4
    },
    {
      "token" : "made",
      "start_offset" : 41,
      "end_offset" : 45,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "history.",
      "start_offset" : 46,
      "end_offset" : 54,
      "type" : "LETTER",
      "position" : 6
    },
    {
      "token" : "working",
      "start_offset" : 55,
      "end_offset" : 62,
      "type" : "ENGLISH",
      "position" : 7
    }
  ]
}
. 号和 . 号之前的 history 分到一起了,成了 history. 了。

fengsmith avatar Apr 08 '21 08:04 fengsmith

是的,ik_smart 对于英文真的太鸡肋了,都是以空格进行分词,比如history.Working直接是history.Working

xwlcn avatar May 01 '21 06:05 xwlcn

解决办法,自定义分词器:

"my_ik_smart": {
  "type": "custom",
  "tokenizer": "ik_smart",
  "filter": [
    "stemmer"
  ],
  "char_filter": [
    "dot_char_filter"
  ]
},
"char_filter": {
  "dot_char_filter": {
    "type": "pattern_replace",
    "pattern": "\\.",
    "replacement": " "
  }
}

xwlcn avatar May 01 '21 09:05 xwlcn