elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

ngram

Open S-Dragon0302 opened this issue 1 year ago • 6 comments
trafficstars

Elasticsearch Version

7.15.1

Installed Plugins

No response

Java Version

bundled

OS Version

mac

Problem Description

PUT /my_index { "settings": { "analysis": { "tokenizer": { "letter_digit_tokenizer": { "type": "pattern", "pattern": "[^\p{L}\p{N}]+" } }, "filter": { "my_ngram_filter": { "type": "ngram", "min_gram": 2, "max_gram": 2 } }, "analyzer": { "my_letter_digit_ngram_analyzer": { "type": "custom", "tokenizer": "letter_digit_tokenizer", "filter": [ "lowercase", "my_ngram_filter" ] } } } } }

GET /my_index/_analyze { "analyzer": "my_letter_digit_ngram_analyzer", "text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ" }

{ "tokens" : [ ] } or { "tokens": [ { "token": "是不", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "不是", "start_offset": 2, "end_offset": 5, "type": "word", "position": 1 }, { "token": "是发", "start_offset": 4, "end_offset": 7, "type": "word", "position": 2 }, { "token": "发现", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, { "token": "现我", "start_offset": 8, "end_offset": 11, "type": "word", "position": 4 }, { "token": "我的", "start_offset": 10, "end_offset": 13, "type": "word", "position": 5 }, { "token": "的字", "start_offset": 12, "end_offset": 15, "type": "word", "position": 6 }, { "token": "字冒", "start_offset": 14, "end_offset": 17, "type": "word", "position": 7 }, { "token": "冒烟", "start_offset": 16, "end_offset": 19, "type": "word", "position": 8 }, { "token": "烟了", "start_offset": 18, "end_offset": 21, "type": "word", "position": 9 } ] }

Steps to Reproduce

PUT /my_index { "settings": { "analysis": { "tokenizer": { "letter_digit_tokenizer": { "type": "pattern", "pattern": "[^\p{L}\p{N}]+" } }, "filter": { "my_ngram_filter": { "type": "ngram", "min_gram": 2, "max_gram": 2 } }, "analyzer": { "my_letter_digit_ngram_analyzer": { "type": "custom", "tokenizer": "letter_digit_tokenizer", "filter": [ "lowercase", "my_ngram_filter" ] } } } } }

GET /my_index/_analyze { "analyzer": "my_letter_digit_ngram_analyzer", "text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ" }

{ "tokens" : [ ] } or { "tokens": [ { "token": "是不", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "不是", "start_offset": 2, "end_offset": 5, "type": "word", "position": 1 }, { "token": "是发", "start_offset": 4, "end_offset": 7, "type": "word", "position": 2 }, { "token": "发现", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, { "token": "现我", "start_offset": 8, "end_offset": 11, "type": "word", "position": 4 }, { "token": "我的", "start_offset": 10, "end_offset": 13, "type": "word", "position": 5 }, { "token": "的字", "start_offset": 12, "end_offset": 15, "type": "word", "position": 6 }, { "token": "字冒", "start_offset": 14, "end_offset": 17, "type": "word", "position": 7 }, { "token": "冒烟", "start_offset": 16, "end_offset": 19, "type": "word", "position": 8 }, { "token": "烟了", "start_offset": 18, "end_offset": 21, "type": "word", "position": 9 } ] }

Logs (if relevant)

No response

S-Dragon0302 avatar Jun 24 '24 08:06 S-Dragon0302

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine avatar Jun 25 '24 03:06 elasticsearchmachine

@S-Dragon0302 Would you please let us know what the problem is you are encountering? I'm going to remove the "bug" label for now as I don't see whats missing. Also keep in mind that if this is a language-specific problem, the language-specific discuss forums (https://discuss.elastic.co/c/in-your-native-tongue/11) might be a good place to ask.

cbuescher avatar Jun 25 '24 07:06 cbuescher

The segmentation result is incorrect. The token I generated from segmentation has no value. Actually, there should be a result

S-Dragon0302 avatar Jun 28 '24 06:06 S-Dragon0302

The segmentation result should be this. { "tokens": [ { "token": "是不", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "不是", "start_offset": 2, "end_offset": 5, "type": "word", "position": 1 }, { "token": "是发", "start_offset": 4, "end_offset": 7, "type": "word", "position": 2 }, { "token": "发现", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, { "token": "现我", "start_offset": 8, "end_offset": 11, "type": "word", "position": 4 }, { "token": "我的", "start_offset": 10, "end_offset": 13, "type": "word", "position": 5 }, { "token": "的字", "start_offset": 12, "end_offset": 15, "type": "word", "position": 6 }, { "token": "字冒", "start_offset": 14, "end_offset": 17, "type": "word", "position": 7 }, { "token": "冒烟", "start_offset": 16, "end_offset": 19, "type": "word", "position": 8 }, { "token": "烟了", "start_offset": 18, "end_offset": 21, "type": "word", "position": 9 } ] }

S-Dragon0302 avatar Jun 28 '24 06:06 S-Dragon0302

The actual result is this. { "tokens" : [ ] }

S-Dragon0302 avatar Jun 28 '24 06:06 S-Dragon0302

@S-Dragon0302

For the given:

是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ the pattern without token filtering:

GET /my_index/_analyze
{
  "filter": [
    "lowercase"
  ],
  "tokenizer": {
    "type": "pattern",
    "pattern": "[^\\p{L}\\p{N}]+"
  },
  "text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}

Results in:

{
  "tokens": [
    {
      "token": "是",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "不",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 1
    },
    {
      "token": "是",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "发",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 3
    },
    {
      "token": "现",
      "start_offset": 8,
      "end_offset": 9,
      "type": "word",
      "position": 4
    },
    {
      "token": "我",
      "start_offset": 10,
      "end_offset": 11,
      "type": "word",
      "position": 5
    },
    {
      "token": "的",
      "start_offset": 12,
      "end_offset": 13,
      "type": "word",
      "position": 6
    },
    {
      "token": "字",
      "start_offset": 14,
      "end_offset": 15,
      "type": "word",
      "position": 7
    },
    {
      "token": "冒",
      "start_offset": 16,
      "end_offset": 17,
      "type": "word",
      "position": 8
    },
    {
      "token": "烟",
      "start_offset": 18,
      "end_offset": 19,
      "type": "word",
      "position": 9
    },
    {
      "token": "了",
      "start_offset": 20,
      "end_offset": 21,
      "type": "word",
      "position": 10
    }
  ]
}

None of those are longer than 1 ngram. So filtering, requiring 2 ngram results in no output.

benwtrent avatar Jun 28 '24 10:06 benwtrent

closing as expected behavior. Filtering requiring 2 ngram when there is only 1 ngram is expected.

benwtrent avatar Jul 12 '24 15:07 benwtrent