analysis-ik icon indicating copy to clipboard operation
analysis-ik copied to clipboard

中文分词问题【中国大陆中国香港】

Open knight2008 opened this issue 3 years ago • 1 comments

ik_smart 和 ik_max_word 都无法分出【中国香港】为一个单独的词,但是却可以分出【中国大陆】。这样会导致搜索【中国香港】 的时候无法命中记录。

GET / 

{
  "name" : "node3",
  "cluster_name" : "es",
  "cluster_uuid" : "llwb1PwMT-yOdsAZRogoZw",
  "version" : {
    "number" : "6.8.5",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "78990e9",
    "build_date" : "2019-11-13T20:04:24.100411Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.2",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
GET /_cat/plugins

node2 analysis-ik 6.8.5
node1 analysis-ik 6.8.5
node3 analysis-ik 6.8.5
POST _analyze

{"text": "中国大陆中国香港", "analyzer":"ik_smart" }
{
  "tokens" : [
    {
      "token" : "中国大陆",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中国",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "香港",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}
POST _analyze

{"text": "中国大陆中国香港", "analyzer":"ik_max_word" }
{
  "tokens" : [
    {
      "token" : "中国大陆",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "国大",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "大陆",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "陆",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "TYPE_CNUM",
      "position" : 4
    },
    {
      "token" : "中国",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "中",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "COUNT",
      "position" : 6
    },
    {
      "token" : "国",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "香港",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 8
    }
  ]
}

knight2008 avatar Jun 11 '21 07:06 knight2008

可以自己加一个词条

zhulongfu avatar Jun 11 '21 09:06 zhulongfu