analysis-ik 在对量词分词时，ik的两种分词器分出不同结果，导致搜索不到结果该怎么处理？

在对量词分词时，ik的两种分词器分出不同结果，导致搜索不到结果该怎么处理？

Open hbpeng opened this issue 5 years ago • 4 comments

比如“10万”这个词，使用ik_smart的分词结果为：

{
  "tokens": [
    {
      "token": "10万",
      "start_offset": 0,
      "end_offset": 3,
      "type": "TYPE_CNUM",
      "position": 0
    }
  ]
}

而是用ik_max_word的分词结果为：

{
  "tokens": [
    {
      "token": "10",
      "start_offset": 0,
      "end_offset": 2,
      "type": "ARABIC",
      "position": 0
    },
    {
      "token": "万",
      "start_offset": 2,
      "end_offset": 3,
      "type": "TYPE_CNUM",
      "position": 1
    }
  ]
}

在ik的最佳实践中，入库时会使用ik_max_word，而搜索时，使用ik_smart，这样就会出现以下情况：文档入库时，lucene会生成“10”和“万”的倒排索引；而使用“10万”进行搜索，因为lucene倒排索引中没有“10万”这个词，搜索不到任何结果。将“万”这个单字加入到自定义词库中也没有任何效果。请问该如何解决上面这种问题？

Jul 12 '19 09:07 hbpeng

如果analyzer使用ik_smart会导致词项的缺失，从全文检索角度可预见对搜索结果的影响很大。我使用v7.3.0的IK尝试了一下，ik_smart是可以准确拆分出10万的，ik_max_word不会将10万作为一个词项拆分出来。我的解决方案是使用query_string:

{
  "query": {
    "query_string": {
      "default_field": "content",
      "query": "(10) AND (万)"
    }
  }
}

测试发现10和万之间如有其它词项，score会低于10万以词对出现的文档。方法仅供参考，暂时未找到最优解。

Nov 07 '19 07:11 freesinger

@hbpeng 怎么解决的

Sep 02 '20 03:09 edcmartin

如果可以把ik_max_word的分词和ik_smart的分词合并, 应该可以解决这个问题.

Jan 05 '21 07:01 JimChen

将量词字典 quantifier.dic 清除，带量词的ik_smart分词结果就与ik_max_word相同了。

Sep 13 '21 05:09 zhengq1

analysis-ik analysis-ik copied to clipboard

在对量词分词时，ik的两种分词器分出不同结果，导致搜索不到结果该怎么处理？

analysis-ik
analysis-ik copied to clipboard