elasticsearch
elasticsearch copied to clipboard
ngram
Elasticsearch Version
7.15.1
Installed Plugins
No response
Java Version
bundled
OS Version
mac
Problem Description
PUT /my_index { "settings": { "analysis": { "tokenizer": { "letter_digit_tokenizer": { "type": "pattern", "pattern": "[^\p{L}\p{N}]+" } }, "filter": { "my_ngram_filter": { "type": "ngram", "min_gram": 2, "max_gram": 2 } }, "analyzer": { "my_letter_digit_ngram_analyzer": { "type": "custom", "tokenizer": "letter_digit_tokenizer", "filter": [ "lowercase", "my_ngram_filter" ] } } } } }
GET /my_index/_analyze { "analyzer": "my_letter_digit_ngram_analyzer", "text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ" }
{ "tokens" : [ ] } or { "tokens": [ { "token": "是不", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "不是", "start_offset": 2, "end_offset": 5, "type": "word", "position": 1 }, { "token": "是发", "start_offset": 4, "end_offset": 7, "type": "word", "position": 2 }, { "token": "发现", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, { "token": "现我", "start_offset": 8, "end_offset": 11, "type": "word", "position": 4 }, { "token": "我的", "start_offset": 10, "end_offset": 13, "type": "word", "position": 5 }, { "token": "的字", "start_offset": 12, "end_offset": 15, "type": "word", "position": 6 }, { "token": "字冒", "start_offset": 14, "end_offset": 17, "type": "word", "position": 7 }, { "token": "冒烟", "start_offset": 16, "end_offset": 19, "type": "word", "position": 8 }, { "token": "烟了", "start_offset": 18, "end_offset": 21, "type": "word", "position": 9 } ] }
Steps to Reproduce
PUT /my_index { "settings": { "analysis": { "tokenizer": { "letter_digit_tokenizer": { "type": "pattern", "pattern": "[^\p{L}\p{N}]+" } }, "filter": { "my_ngram_filter": { "type": "ngram", "min_gram": 2, "max_gram": 2 } }, "analyzer": { "my_letter_digit_ngram_analyzer": { "type": "custom", "tokenizer": "letter_digit_tokenizer", "filter": [ "lowercase", "my_ngram_filter" ] } } } } }
GET /my_index/_analyze { "analyzer": "my_letter_digit_ngram_analyzer", "text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ" }
{ "tokens" : [ ] } or { "tokens": [ { "token": "是不", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "不是", "start_offset": 2, "end_offset": 5, "type": "word", "position": 1 }, { "token": "是发", "start_offset": 4, "end_offset": 7, "type": "word", "position": 2 }, { "token": "发现", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, { "token": "现我", "start_offset": 8, "end_offset": 11, "type": "word", "position": 4 }, { "token": "我的", "start_offset": 10, "end_offset": 13, "type": "word", "position": 5 }, { "token": "的字", "start_offset": 12, "end_offset": 15, "type": "word", "position": 6 }, { "token": "字冒", "start_offset": 14, "end_offset": 17, "type": "word", "position": 7 }, { "token": "冒烟", "start_offset": 16, "end_offset": 19, "type": "word", "position": 8 }, { "token": "烟了", "start_offset": 18, "end_offset": 21, "type": "word", "position": 9 } ] }
Logs (if relevant)
No response
Pinging @elastic/es-search (Team:Search)
@S-Dragon0302 Would you please let us know what the problem is you are encountering? I'm going to remove the "bug" label for now as I don't see whats missing. Also keep in mind that if this is a language-specific problem, the language-specific discuss forums (https://discuss.elastic.co/c/in-your-native-tongue/11) might be a good place to ask.
The segmentation result is incorrect. The token I generated from segmentation has no value. Actually, there should be a result
The segmentation result should be this. { "tokens": [ { "token": "是不", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "不是", "start_offset": 2, "end_offset": 5, "type": "word", "position": 1 }, { "token": "是发", "start_offset": 4, "end_offset": 7, "type": "word", "position": 2 }, { "token": "发现", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, { "token": "现我", "start_offset": 8, "end_offset": 11, "type": "word", "position": 4 }, { "token": "我的", "start_offset": 10, "end_offset": 13, "type": "word", "position": 5 }, { "token": "的字", "start_offset": 12, "end_offset": 15, "type": "word", "position": 6 }, { "token": "字冒", "start_offset": 14, "end_offset": 17, "type": "word", "position": 7 }, { "token": "冒烟", "start_offset": 16, "end_offset": 19, "type": "word", "position": 8 }, { "token": "烟了", "start_offset": 18, "end_offset": 21, "type": "word", "position": 9 } ] }
The actual result is this. { "tokens" : [ ] }
@S-Dragon0302
For the given:
是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ the pattern without token filtering:
GET /my_index/_analyze
{
"filter": [
"lowercase"
],
"tokenizer": {
"type": "pattern",
"pattern": "[^\\p{L}\\p{N}]+"
},
"text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}
Results in:
{
"tokens": [
{
"token": "是",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "不",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "是",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "发",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 3
},
{
"token": "现",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 4
},
{
"token": "我",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 5
},
{
"token": "的",
"start_offset": 12,
"end_offset": 13,
"type": "word",
"position": 6
},
{
"token": "字",
"start_offset": 14,
"end_offset": 15,
"type": "word",
"position": 7
},
{
"token": "冒",
"start_offset": 16,
"end_offset": 17,
"type": "word",
"position": 8
},
{
"token": "烟",
"start_offset": 18,
"end_offset": 19,
"type": "word",
"position": 9
},
{
"token": "了",
"start_offset": 20,
"end_offset": 21,
"type": "word",
"position": 10
}
]
}
None of those are longer than 1 ngram. So filtering, requiring 2 ngram results in no output.
closing as expected behavior. Filtering requiring 2 ngram when there is only 1 ngram is expected.