analysis-pinyin icon indicating copy to clipboard operation
analysis-pinyin copied to clipboard

6.6拼音不高亮

Open azwc opened this issue 5 years ago • 3 comments

@medcl 几经尝试未果,特来寻求帮助。

ES版本6.6,分词版本6.6 症状如下: https://github.com/medcl/elasticsearch-analysis-pinyin/issues/82 按照上面的回答操作了一遍,得到的结果还是没有像预期的那样高亮。

PUT index333
{
    "mappings": {
      "type":{
        "properties": {
          "name":{
            "type": "text",
            "analyzer": "pinyin"
          }
        }
      }
    }
}

POST index333/type/1
{
  "name":"教程123测试数据教程长度可能比较长1231546"
}

GET index333/_search
{
  "query": {"match": {
    "name": "jiaocheng"
  }},"highlight": {
    "fields": {
      "name":{
        "fragment_size": 20
      }
    }
  }
}

得到的结果如下所:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.8593601,
    "hits" : [
      {
        "_index" : "index333",
        "_type" : "type",
        "_id" : "1",
        "_score" : 0.8593601,
        "_source" : {
          "name" : "教程123测试数据教程长度可能比较长1231546"
        },
        "highlight" : {
          "name" : [
            "教程123测试数据教程长度可能比较长1231546"
          ]
        }
      }
    ]
  }
}

请问,为什么在我这里就没高亮了呢?

azwc avatar Mar 13 '19 06:03 azwc

offset现在默认是忽略了,高亮的话需要开启。

DELETE index333
PUT index333/
{
  "settings": {
    "analysis": {"analyzer": {
      "pinyin":{
        "type":"pinyin",
        "ignore_pinyin_offset":false
      }
    }}
  }, "mappings": {
      "type":{
        "properties": {
          "name":{
            "type": "text",
            "analyzer": "pinyin"
          }
        }
      }
    }
}



GET index333/_search
{
  "query": {
    "match": {
      "name": "jiaocheng"
    }
  },
  "highlight": {
    "fields": {
      "name": {}
    }
  }
}

medcl avatar Mar 14 '19 05:03 medcl

@medcl 谢谢,上一个问题解决了,现在出了个新的问题:

有空格和标点符号的
GET /test_index/_analyze
{
  "text": ["I have a dream."],
  "analyzer": "pinyin"
}
结果:
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ha",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "v",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "e",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "re",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "a",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "m",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 8
    }
  ]
}

没空格没符号
GET /test_index/_analyze
{
  "text": ["Ihaveadream"],
  "analyzer": "pinyin"
}
结果:
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ha",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "v",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "e",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "re",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "a",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "m",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 8
    }
  ]
}

这个该怎么解决偏移量问题。

实际问题中,一段文本,可能包含有多种语言,我在一个字段里建了两个子字段:en、py。

字段:
"title":{
          "type": "text",
          "term_vector": "with_positions_offsets",
          "analyzer": "es_cn",
          "fields": {
            "py": {
              "type": "text",
              "term_vector": "with_positions_offsets",
              "analyzer": "pinyin"
            },
            "en": {
              "type": "text",
              "term_vector": "with_positions_offsets",
              "analyzer": "es_en"
            }
          }
        }

存入一条数据:
PUT /test_index/test_type/2
{
  "title": "I have a dream."
}

查询:
GET test_index/_search
{
  "query": {
    "multi_match": {
      "query": "哈哈",
      "fields": [
        "title.*"
      ],
      "type": "most_fields",
      "minimum_should_match": "1"
    }
  },
  "highlight": {
    "require_field_match": "false",
    "fields": {
      "title": {
        "matched_fields": ["title","title.py","title.en"],
        "type": "fvh"
      }
    },
    "pre_tags": ["<b>"],
    "post_tags": ["</b>"]
  }
}

结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "test_type",
        "_id" : "2",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "I have a dream."
        },
        "highlight" : {
          "title" : [
            "I hav<b>e </b>a dream."
          ]
        }
      }
    ]
  }
}

start_offset在英文文本或者拼音文本的情况下,用pinyin分词会变得不正常,再如:

GET /test_index/_analyze
{
  "text": ["老猫,i have a dream"],
  "analyzer": "pinyin"
}

{
  "tokens" : [
    {
      "token" : "lao",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "mao",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "i",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "ha",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "v",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "e",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "a",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "d",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "re",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "a",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "m",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "word",
      "position" : 10
    }
  ]
}

中文老猫的拼音分对了,但是后面英文部分,随着空格和标点的增多,start_offset值起始偏移量也在增大,导致高亮显示位置不正常。

针对这种情况,请问有什么好的建议吗?

azwc avatar Mar 14 '19 08:03 azwc

我的高亮偏移也不正常。

codingcn avatar Jan 29 '21 12:01 codingcn