OpenSearch [BUG] dedup parameter in hunspell token filter is not working

Describe the bug

dedup parameter in hunspell filter should remove duplicate tokens, but only seems to be used in unit test

Related component

Indexing

To Reproduce

Add hunspell en_GB dictionary files in config directory.

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_hunspell_filter": {
          "type": "hunspell",
          "lang": "en_GB",
          "longest_only": true
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_hunspell_filter"
          ]
        }
      }
    }
  }
}

POST /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "runners runner"
}

result:

{
  "tokens": [
    {
      "token": "runner",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "runner",
      "start_offset": 8,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Expected behavior

Expected result:

{
  "tokens": [
    {
      "token": "runner",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context Add any other context about the problem here.

Aug 26 '24 12:08 AntonEliatra

Thanks @AntonEliatra. Maybe try to write a YAML REST test for this one short of a fix? We would also love a test and any missing support for analysis filters using hunspell in https://github.com/opensearch-project/opensearch-api-specification.

Aug 26 '24 19:08 dblock

[Indexing Triage - 09/02] @AntonEliatra, Thanks for raising this issue! As @dblock mentioned, it would be great if you can add a YAML rest test for the bug and probably raise a PR for the same. Also, can you please share more details on the expectations from the hunspell filter and what is the impact of duplicate tokens?

Sep 02 '24 15:09 vikasvb90

@vikasvb90 Unfortunately I do not have the resources to submit PR for this, however, the deduplication is needed if the user does not need to receive multiple token with the same value as the output. This would also mean smaller size of the output in general.

Sep 04 '24 16:09 AntonEliatra

[Catch All Triage - 1, 2, 3, 4, 5]

Sep 16 '24 16:09 dblock

I'm not convinced that this is a bug.

The way that the tokens get turned into terms leads to very little in the way of duplication. The rough 4-dimensional hierarchy is:

field
     +-> terms
              +-> docs
                      +-> positions + offsets

The only duplication is that you'll have an position + offsets for the extra occurrence. (By default, we don't actually store offsets -- just positions.) If you don't care about positions + offsets, you can change the index_options for your text field to freqs (if you want to know that the term occurred twice) or docs (if you don't care how many times the term occurred).

Regarding the hunspell token filter dedup parameter specifically, it's not meant to deduplicate across tokens, but rather to avoid producing multiple duplicate stemming terms for the same token. (It's implemented by calling this method.) Since it applies stemming rules recursively, I imagine two different rule paths could reach the same term.

Note that setting "longest_only":true ignores the value of dedup, since you're only keeping the longest stem of each token, which implicitly deduplicates anyway (since you're only emitting one stemmed term for each token).

Sep 17 '24 01:09 msfroh

In that case this issue can be closed.

Oct 09 '24 09:10 AntonEliatra