elasticsearch-analysis-ansj icon indicating copy to clipboard operation
elasticsearch-analysis-ansj copied to clipboard

6.5.4版本搜索返回为空

Open lht1221 opened this issue 6 years ago • 4 comments

使用index_ansj存储,query_ansj搜索 mapping简略配置如下:

"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"my_char_filter"
],
"tokenizer": "index_ansj"
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
}
},
"title": {
"search_analyzer": "query_ansj",
"fielddata": true,
"analyzer": "my_analyzer",
"type": "text"
},

搜索语句如下返回结果为空(搜索词加引号)

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "baseInfo.title",
"query": ""全新奥迪""
}
}
],
"must_not": [ ],
"should": [ ]
}
},
"from": 0,
"size": 10,
"sort": [ ],
"aggs": { }
}

搜索词不加引号时正常。但返回结果数量很多。 但并不是所有词加引号都不会返回结果,比如"谍照曝光"等词可以正常返回。 我看了下默认词典好像词性只为n的词前后都不能加其他词去搜索, 比如"谍照表"词性为n,文章中原文是“本田Urban EV谍照表示其车型由概念车的三门版”。用"谍照表"可以搜索出结果,但"谍照表示"无法搜索到结果,但用"谍照表"+"示"或"谍照表"+"表示"两个词同时搜索都可以得到文章。 同样方法在5.5.0版本中可以搜索到结果,但这个版本没有单独定义搜索分词,全部使用的dic_ansj分词。mapping如下

"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"my_char_filter"
],
"tokenizer": "dic_ansj"
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
}
},
"title": {
"fielddata": true,
"analyzer": "my_analyzer",
"type": "text"

我的表达能力有限,请大大多多理解给予帮助。谢谢

lht1221 avatar Aug 23 '19 09:08 lht1221

"本田Urban EV谍照表示其车型由概念车的三门版",tokens:

{
    "tokens": [
       ...,
        {
            "token": "谍",
            "start_offset": 10,
            "end_offset": 11,
            "type": "null",
            "position": 4
        },
        {
            "token": "照",
            "start_offset": 11,
            "end_offset": 12,
            "type": "v",
            "position": 5
        },
        {
            "token": "表示",
            "start_offset": 12,
            "end_offset": 14,
            "type": "v",
            "position": 6
        },
        ...
    ]
}

"谍照表示",tokens:

{
    "tokens": [
        {
            "token": "谍",
            "start_offset": 0,
            "end_offset": 1,
            "type": "null",
            "position": 0
        },
        {
            "token": "照",
            "start_offset": 1,
            "end_offset": 2,
            "type": "v",
            "position": 1
        },
        {
            "token": "表示",
            "start_offset": 2,
            "end_offset": 4,
            "type": "v",
            "position": 2
        }
    ]
}

shi-yuan avatar Aug 25 '19 02:08 shi-yuan

"谍照表示",是可以的

{
  "query": {
    "query_string": {
      "query": "\"谍照表示\"",
      "default_field": "title"
    }
  }
}

shi-yuan avatar Aug 25 '19 02:08 shi-yuan

这个,建议您看看,tokens和索引里的_termvectors

shi-yuan avatar Aug 25 '19 02:08 shi-yuan

@shi-yuan 不好意思过了这么久再次打扰了,我觉得我确定了搜索不到数据的问题为何发生了, 比如上面词句实际拆分为 { "token": "谍", "start_offset": 10, "end_offset": 11, "type": "ng", "position": 6 } , { "token": "照表", "start_offset": 11, "end_offset": 13, "type": "n", "position": 7 } , { "token": "照", "start_offset": 11, "end_offset": 12, "type": "v", "position": 8 } , { "token": "表示", "start_offset": 12, "end_offset": 14, "type": "v", "position": 9 } , { "token": "表", "start_offset": 12, "end_offset": 13, "type": "n", "position": 10 } , { "token": "示", "start_offset": 13, "end_offset": 14, "type": "vg", "position": 11 }

谍照表符合"position": 6+"position": 7这两个连续的所以可以得到结果, 谍照表示 则是符合"position": 6+"position": 7+"position": 11或其他组合方式,但中间空着position": 8-10,所以搜索时没能匹配到。 我觉得原因是词组的position组成数字连续时可以搜索到数据,不连续的时候则搜索结果为空。

例如另外一个短句安达保险金融险部相关人士介绍 搜索用"安达保险"确保不拆词 虽然拆词中存在安达保险("position": 4774+"position": 4778),但因为position不连续,所以无法搜到"安达保险",而同样不拆词搜索"安达保"("position": 4775+"position": 4776),则可以得到结果 { "token": "安达", "start_offset": 3410, "end_offset": 3412, "type": "nz", "position": 4774 } , { "token": "安", "start_offset": 3410, "end_offset": 3411, "type": "ag", "position": 4775 } , { "token": "达保", "start_offset": 3411, "end_offset": 3413, "type": "nr", "position": 4776 } , { "token": "达", "start_offset": 3411, "end_offset": 3412, "type": "v", "position": 4777 } , { "token": "保险金", "start_offset": 3412, "end_offset": 3415, "type": "n", "position": 4778 } , { "token": "保险", "start_offset": 3412, "end_offset": 3414, "type": "n", "position": 4779 } , { "token": "保", "start_offset": 3412, "end_offset": 3413, "type": "v", "position": 4780 } , { "token": "险", "start_offset": 3413, "end_offset": 3414, "type": "ng", "position": 4781 } , { "token": "金融", "start_offset": 3414, "end_offset": 3416, "type": "n", "position": 4782 } , { "token": "金", "start_offset": 3414, "end_offset": 3415, "type": "b", "position": 4783 } , { "token": "融", "start_offset": 3415, "end_offset": 3416, "type": "vi", "position": 4784 } , {

lht1221 avatar Apr 08 '20 04:04 lht1221