elasticsearch-sudachi icon indicating copy to clipboard operation
elasticsearch-sudachi copied to clipboard

Unable to Reproduce Example as Described in Documentation

Open arcoyk opened this issue 1 year ago • 3 comments

Hi,

Thank you for the great plugin. I cannot reproduce the example written in the official documentation.

Input:

{
  "analyzer": "sudachi_analyzer",
  "text": "寿司がおいしいね"
}

Expected (as described in the document):

{
  "tokens": [
    {
      "token": "寿司",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "美味しい",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 2
    }
  ]
}

Actual (v3.1.0 release with Opensearch 2.6.0)

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 3,
            "position": 1,
            "start_offset": 2,
            "token": "が",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "美味しい",
            "type": "word"
        }
    ]
}

If you think of any possible causes, please leave a comment. I appreciate your assistance.

arcoyk avatar Mar 05 '24 12:03 arcoyk

I assume you applied the sudachi_part_of_speech setting in the README, but I could not reproduce your results here. Please let us know your configuration file and the dictionaries you are using.

kazuma-t avatar Mar 06 '24 05:03 kazuma-t

We are using the full dict and base configuration.

My apologies, the example provided is not quite accurate. I think the problem is actually related to a change in how baseform, readingform and normalizedform are being applied..

I tried to put together a reproducible, minimal example,

{
  "settings": {
    "index": {
      "analysis" : {
        "analyzer" : {
          "sudachi_analyzer" : {
            "filter" : [
              "sudachi_ja_stop",
              "sudachi_baseform"
            ],
            "type" : "custom",
            "tokenizer" : "sudachi_tokenizer"
          }
        },
        "tokenizer" : {
          "sudachi_tokenizer" : {
            "type" : "sudachi_tokenizer"
          }
        }
      }
    }
  }
}

If we analyze this text,

{
  "analyzer": "sudachi_analyzer",
  "text": "および"
}

I expect to get no tokens, because および is defined in stopwords.txt, and it should be removed in the sudachi_ja_stop filter. However, I still get one token,

{
  "tokens" : [
    {
      "token" : "および",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    }
  ]
}

This is in v3.1.0 running on OpenSearch 2.6.0 I have confirm in our old version, v2.1.0 running on Elasticsearch 7.10.2, this expected behavior occurs (no tokens).

I see from the changelog, in v3.0.0, there was a change related to how analysis chains are processed. Is this a side-effect of that?

arcoyk avatar Mar 07 '24 09:03 arcoyk

On further investigation, this is the exact same issue as #111 Please close if you feel necessary, but it would be nice to resolve #111

arcoyk avatar Mar 07 '24 09:03 arcoyk