analysis-pinyin
analysis-pinyin copied to clipboard
6.6拼音不高亮
@medcl 几经尝试未果,特来寻求帮助。
ES版本6.6,分词版本6.6 症状如下: https://github.com/medcl/elasticsearch-analysis-pinyin/issues/82 按照上面的回答操作了一遍,得到的结果还是没有像预期的那样高亮。
PUT index333
{
"mappings": {
"type":{
"properties": {
"name":{
"type": "text",
"analyzer": "pinyin"
}
}
}
}
}
POST index333/type/1
{
"name":"教程123测试数据教程长度可能比较长1231546"
}
GET index333/_search
{
"query": {"match": {
"name": "jiaocheng"
}},"highlight": {
"fields": {
"name":{
"fragment_size": 20
}
}
}
}
得到的结果如下所:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.8593601,
"hits" : [
{
"_index" : "index333",
"_type" : "type",
"_id" : "1",
"_score" : 0.8593601,
"_source" : {
"name" : "教程123测试数据教程长度可能比较长1231546"
},
"highlight" : {
"name" : [
"教程123测试数据教程长度可能比较长1231546"
]
}
}
]
}
}
请问,为什么在我这里就没高亮了呢?
offset现在默认是忽略了,高亮的话需要开启。
DELETE index333
PUT index333/
{
"settings": {
"analysis": {"analyzer": {
"pinyin":{
"type":"pinyin",
"ignore_pinyin_offset":false
}
}}
}, "mappings": {
"type":{
"properties": {
"name":{
"type": "text",
"analyzer": "pinyin"
}
}
}
}
}
GET index333/_search
{
"query": {
"match": {
"name": "jiaocheng"
}
},
"highlight": {
"fields": {
"name": {}
}
}
}
@medcl 谢谢,上一个问题解决了,现在出了个新的问题:
有空格和标点符号的
GET /test_index/_analyze
{
"text": ["I have a dream."],
"analyzer": "pinyin"
}
结果:
{
"tokens" : [
{
"token" : "i",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "ha",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "v",
"start_offset" : 7,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "e",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 3
},
{
"token" : "a",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 4
},
{
"token" : "d",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 5
},
{
"token" : "re",
"start_offset" : 11,
"end_offset" : 13,
"type" : "word",
"position" : 6
},
{
"token" : "a",
"start_offset" : 13,
"end_offset" : 14,
"type" : "word",
"position" : 7
},
{
"token" : "m",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 8
}
]
}
没空格没符号
GET /test_index/_analyze
{
"text": ["Ihaveadream"],
"analyzer": "pinyin"
}
结果:
{
"tokens" : [
{
"token" : "i",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "ha",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "v",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "e",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "a",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 4
},
{
"token" : "d",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 5
},
{
"token" : "re",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 6
},
{
"token" : "a",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 7
},
{
"token" : "m",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 8
}
]
}
这个该怎么解决偏移量问题。
实际问题中,一段文本,可能包含有多种语言,我在一个字段里建了两个子字段:en、py。
字段:
"title":{
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "es_cn",
"fields": {
"py": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "pinyin"
},
"en": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "es_en"
}
}
}
存入一条数据:
PUT /test_index/test_type/2
{
"title": "I have a dream."
}
查询:
GET test_index/_search
{
"query": {
"multi_match": {
"query": "哈哈",
"fields": [
"title.*"
],
"type": "most_fields",
"minimum_should_match": "1"
}
},
"highlight": {
"require_field_match": "false",
"fields": {
"title": {
"matched_fields": ["title","title.py","title.en"],
"type": "fvh"
}
},
"pre_tags": ["<b>"],
"post_tags": ["</b>"]
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "test_index",
"_type" : "test_type",
"_id" : "2",
"_score" : 0.5753642,
"_source" : {
"title" : "I have a dream."
},
"highlight" : {
"title" : [
"I hav<b>e </b>a dream."
]
}
}
]
}
}
start_offset在英文文本或者拼音文本的情况下,用pinyin分词会变得不正常,再如:
GET /test_index/_analyze
{
"text": ["老猫,i have a dream"],
"analyzer": "pinyin"
}
{
"tokens" : [
{
"token" : "lao",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "mao",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "i",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 2
},
{
"token" : "ha",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 3
},
{
"token" : "v",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 4
},
{
"token" : "e",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 5
},
{
"token" : "a",
"start_offset" : 11,
"end_offset" : 12,
"type" : "word",
"position" : 6
},
{
"token" : "d",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 7
},
{
"token" : "re",
"start_offset" : 13,
"end_offset" : 15,
"type" : "word",
"position" : 8
},
{
"token" : "a",
"start_offset" : 15,
"end_offset" : 16,
"type" : "word",
"position" : 9
},
{
"token" : "m",
"start_offset" : 16,
"end_offset" : 17,
"type" : "word",
"position" : 10
}
]
}
中文老猫的拼音分对了,但是后面英文部分,随着空格和标点的增多,start_offset值起始偏移量也在增大,导致高亮显示位置不正常。
针对这种情况,请问有什么好的建议吗?
我的高亮偏移也不正常。