[BUG] dedup parameter in hunspell token filter is not working
Describe the bug
dedup parameter in hunspell filter should remove duplicate tokens, but only seems to be used in unit test
Related component
Indexing
To Reproduce
- Add hunspell en_GB dictionary files in config directory.
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_hunspell_filter": {
"type": "hunspell",
"lang": "en_GB",
"longest_only": true
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_hunspell_filter"
]
}
}
}
}
}
POST /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "runners runner"
}
result:
{
"tokens": [
{
"token": "runner",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "runner",
"start_offset": 8,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Expected behavior
Expected result:
{
"tokens": [
{
"token": "runner",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Additional Details
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context Add any other context about the problem here.
Thanks @AntonEliatra. Maybe try to write a YAML REST test for this one short of a fix? We would also love a test and any missing support for analysis filters using hunspell in https://github.com/opensearch-project/opensearch-api-specification.
[Indexing Triage - 09/02] @AntonEliatra, Thanks for raising this issue! As @dblock mentioned, it would be great if you can add a YAML rest test for the bug and probably raise a PR for the same. Also, can you please share more details on the expectations from the hunspell filter and what is the impact of duplicate tokens?
@vikasvb90 Unfortunately I do not have the resources to submit PR for this, however, the deduplication is needed if the user does not need to receive multiple token with the same value as the output. This would also mean smaller size of the output in general.
I'm not convinced that this is a bug.
The way that the tokens get turned into terms leads to very little in the way of duplication. The rough 4-dimensional hierarchy is:
field
+-> terms
+-> docs
+-> positions + offsets
The only duplication is that you'll have an position + offsets for the extra occurrence. (By default, we don't actually store offsets -- just positions.) If you don't care about positions + offsets, you can change the index_options for your text field to freqs (if you want to know that the term occurred twice) or docs (if you don't care how many times the term occurred).
Regarding the hunspell token filter dedup parameter specifically, it's not meant to deduplicate across tokens, but rather to avoid producing multiple duplicate stemming terms for the same token. (It's implemented by calling this method.) Since it applies stemming rules recursively, I imagine two different rule paths could reach the same term.
Note that setting "longest_only":true ignores the value of dedup, since you're only keeping the longest stem of each token, which implicitly deduplicates anyway (since you're only emitting one stemmed term for each token).
In that case this issue can be closed.