elasticsearch-analysis-ansj
elasticsearch-analysis-ansj copied to clipboard
6.5.4版本搜索返回为空
使用index_ansj存储,query_ansj搜索 mapping简略配置如下:
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"my_char_filter"
],
"tokenizer": "index_ansj"
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
}
},
"title": {
"search_analyzer": "query_ansj",
"fielddata": true,
"analyzer": "my_analyzer",
"type": "text"
},
搜索语句如下返回结果为空(搜索词加引号)
{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "baseInfo.title",
"query": ""全新奥迪""
}
}
],
"must_not": [ ],
"should": [ ]
}
},
"from": 0,
"size": 10,
"sort": [ ],
"aggs": { }
}
搜索词不加引号时正常。但返回结果数量很多。 但并不是所有词加引号都不会返回结果,比如"谍照曝光"等词可以正常返回。 我看了下默认词典好像词性只为n的词前后都不能加其他词去搜索, 比如"谍照表"词性为n,文章中原文是“本田Urban EV谍照表示其车型由概念车的三门版”。用"谍照表"可以搜索出结果,但"谍照表示"无法搜索到结果,但用"谍照表"+"示"或"谍照表"+"表示"两个词同时搜索都可以得到文章。 同样方法在5.5.0版本中可以搜索到结果,但这个版本没有单独定义搜索分词,全部使用的dic_ansj分词。mapping如下
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"my_char_filter"
],
"tokenizer": "dic_ansj"
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
}
},
"title": {
"fielddata": true,
"analyzer": "my_analyzer",
"type": "text"
我的表达能力有限,请大大多多理解给予帮助。谢谢
"本田Urban EV谍照表示其车型由概念车的三门版",tokens:
{
"tokens": [
...,
{
"token": "谍",
"start_offset": 10,
"end_offset": 11,
"type": "null",
"position": 4
},
{
"token": "照",
"start_offset": 11,
"end_offset": 12,
"type": "v",
"position": 5
},
{
"token": "表示",
"start_offset": 12,
"end_offset": 14,
"type": "v",
"position": 6
},
...
]
}
"谍照表示",tokens:
{
"tokens": [
{
"token": "谍",
"start_offset": 0,
"end_offset": 1,
"type": "null",
"position": 0
},
{
"token": "照",
"start_offset": 1,
"end_offset": 2,
"type": "v",
"position": 1
},
{
"token": "表示",
"start_offset": 2,
"end_offset": 4,
"type": "v",
"position": 2
}
]
}
"谍照表示",是可以的
{
"query": {
"query_string": {
"query": "\"谍照表示\"",
"default_field": "title"
}
}
}
这个,建议您看看,tokens和索引里的_termvectors
@shi-yuan
不好意思过了这么久再次打扰了,我觉得我确定了搜索不到数据的问题为何发生了,
比如上面词句实际拆分为
{ "token": "谍", "start_offset": 10, "end_offset": 11, "type": "ng", "position": 6 } , { "token": "照表", "start_offset": 11, "end_offset": 13, "type": "n", "position": 7 } , { "token": "照", "start_offset": 11, "end_offset": 12, "type": "v", "position": 8 } , { "token": "表示", "start_offset": 12, "end_offset": 14, "type": "v", "position": 9 } , { "token": "表", "start_offset": 12, "end_offset": 13, "type": "n", "position": 10 } , { "token": "示", "start_offset": 13, "end_offset": 14, "type": "vg", "position": 11 }
谍照表符合"position": 6+"position": 7这两个连续的所以可以得到结果, 谍照表示 则是符合"position": 6+"position": 7+"position": 11或其他组合方式,但中间空着position": 8-10,所以搜索时没能匹配到。 我觉得原因是词组的position组成数字连续时可以搜索到数据,不连续的时候则搜索结果为空。
例如另外一个短句安达保险金融险部相关人士介绍
搜索用"安达保险"确保不拆词
虽然拆词中存在安达保险("position": 4774+"position": 4778),但因为position不连续,所以无法搜到"安达保险",而同样不拆词搜索"安达保"("position": 4775+"position": 4776),则可以得到结果
{ "token": "安达", "start_offset": 3410, "end_offset": 3412, "type": "nz", "position": 4774 } , { "token": "安", "start_offset": 3410, "end_offset": 3411, "type": "ag", "position": 4775 } , { "token": "达保", "start_offset": 3411, "end_offset": 3413, "type": "nr", "position": 4776 } , { "token": "达", "start_offset": 3411, "end_offset": 3412, "type": "v", "position": 4777 } , { "token": "保险金", "start_offset": 3412, "end_offset": 3415, "type": "n", "position": 4778 } , { "token": "保险", "start_offset": 3412, "end_offset": 3414, "type": "n", "position": 4779 } , { "token": "保", "start_offset": 3412, "end_offset": 3413, "type": "v", "position": 4780 } , { "token": "险", "start_offset": 3413, "end_offset": 3414, "type": "ng", "position": 4781 } , { "token": "金融", "start_offset": 3414, "end_offset": 3416, "type": "n", "position": 4782 } , { "token": "金", "start_offset": 3414, "end_offset": 3415, "type": "b", "position": 4783 } , { "token": "融", "start_offset": 3415, "end_offset": 3416, "type": "vi", "position": 4784 } , {