elasticsearch-sudachi
elasticsearch-sudachi copied to clipboard
Unable to Reproduce Example as Described in Documentation
Hi,
Thank you for the great plugin. I cannot reproduce the example written in the official documentation.
Input:
{
"analyzer": "sudachi_analyzer",
"text": "寿司がおいしいね"
}
Expected (as described in the document):
{
"tokens": [
{
"token": "寿司",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "美味しい",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 2
}
]
}
Actual (v3.1.0 release with Opensearch 2.6.0)
{
"tokens": [
{
"end_offset": 2,
"position": 0,
"start_offset": 0,
"token": "寿司",
"type": "word"
},
{
"end_offset": 3,
"position": 1,
"start_offset": 2,
"token": "が",
"type": "word"
},
{
"end_offset": 7,
"position": 2,
"start_offset": 3,
"token": "美味しい",
"type": "word"
}
]
}
If you think of any possible causes, please leave a comment. I appreciate your assistance.
I assume you applied the sudachi_part_of_speech setting in the README, but I could not reproduce your results here.
Please let us know your configuration file and the dictionaries you are using.
We are using the full dict and base configuration.
My apologies, the example provided is not quite accurate. I think the problem is actually related to a change in how baseform, readingform and normalizedform are being applied..
I tried to put together a reproducible, minimal example,
{
"settings": {
"index": {
"analysis" : {
"analyzer" : {
"sudachi_analyzer" : {
"filter" : [
"sudachi_ja_stop",
"sudachi_baseform"
],
"type" : "custom",
"tokenizer" : "sudachi_tokenizer"
}
},
"tokenizer" : {
"sudachi_tokenizer" : {
"type" : "sudachi_tokenizer"
}
}
}
}
}
}
If we analyze this text,
{
"analyzer": "sudachi_analyzer",
"text": "および"
}
I expect to get no tokens, because および is defined in stopwords.txt, and it should be removed in the sudachi_ja_stop filter. However, I still get one token,
{
"tokens" : [
{
"token" : "および",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
}
]
}
This is in v3.1.0 running on OpenSearch 2.6.0 I have confirm in our old version, v2.1.0 running on Elasticsearch 7.10.2, this expected behavior occurs (no tokens).
I see from the changelog, in v3.0.0, there was a change related to how analysis chains are processed. Is this a side-effect of that?
On further investigation, this is the exact same issue as #111 Please close if you feel necessary, but it would be nice to resolve #111