elasticsuite icon indicating copy to clipboard operation
elasticsuite copied to clipboard

Exclude `sku` from spellcheck / fuzzy search

Open bosch-manuel opened this issue 1 year ago • 11 comments

I'm trying to exclude the sku field from spellchecked searches. Since we are only using consecutive product numbers, returning "similar" products based on sku is of no use.

I hoped that changing isUsedInSpellcheck to 0 would be enough. However, this doesn't stop the sku field from being copied to the spelling fields at index time: spelling.whitespace spelling.phonetic spelling.shingle

What do I have to do, to not having sku in those fields ? Where is this copy_to behavior controlled ? Maybe I just missed it in the docs ?

bosch-manuel avatar Oct 04 '23 10:10 bosch-manuel

Hello @bosch-manuel,

Yes, the 'sku' attribute is referenced in a few (very old) places in the code directly to take it into account as a particular case of an attribute using the 'reference' analyzer, for instance in exact match queries. This lead to a few releases recently proposing a few experimental settings to generalize that behavior to other attributes using that analyzer.

But what you're describing is a bit strange, let us check locally if it could simply be a caching issue.

Regards,

rbayet avatar Oct 04 '23 16:10 rbayet

Hello @rbayet,

did you find the time to check it locally ? I cleared all caches and even switched the elastic search instance. Same result.

bosch-manuel avatar Oct 10 '23 06:10 bosch-manuel

Hello @bosch-manuel,

No sorry I did not, I was off for a couple of days. I'll let you know.

Regards,

rbayet avatar Oct 10 '23 07:10 rbayet

Oh I'm sorry! You can ignore this issue. Seems like I just had another attribute url_key having the same value as sku. So not sku was copied to spelling, it was my url_key... It was my fault.

bosch-manuel avatar Oct 10 '23 07:10 bosch-manuel

Maybe I closed the ticket to early. sku is not copied to spelling anymore, but with sku set to isUsedInSpellcheck = 0, the generated search query also changes. Now, sku is not used in the query at all.

Before:

"must": {
                "bool": {
                    "must": [],
                    "must_not": [],
                    "should": [
                        {
                            "multi_match": {
                                "query": "1000364",
                                "fields": [
                                    "spelling.whitespace^10",
                                    "name.whitespace^50",
                                    "sku.whitespace^60",
                                    "spelling^1",
                                    "sku.sku_ngram_analyser^6"
                                ],
                                "minimum_should_match": "100%",
                                "tie_breaker": 1.0,
                                "boost": 1,
                                "type": "best_fields",
                                "cutoff_frequency": 0.15,
                                "fuzziness": "AUTO",
                                "prefix_length": 1,
                                "max_expansions": 10
                            }
                        },
                        {
                            "multi_match": {
                                "query": "1000364",
                                "fields": [
                                    "spelling.phonetic^1"
                                ],
                                "minimum_should_match": "100%",
                                "tie_breaker": 1.0,
                                "boost": 1,
                                "type": "best_fields",
                                "cutoff_frequency": 0.15
                            }
                        }
                    ],
                    "minimum_should_match": 1,
                    "boost": 1
                }
            },

After disabling spell checking to sku:

 "must": {
        "bool": {
          "must": [],
          "must_not": [],
          "should": [
            {
              "multi_match": {
                "query": "1000364",
                "fields": [
                  "spelling.whitespace^10",
                  "name.whitespace^50",
                  "spelling^1"
                ],
                "minimum_should_match": "100%",
                "tie_breaker": 1.0,
                "boost": 1,
                "type": "best_fields",
                "cutoff_frequency": 0.15,
                "fuzziness": "AUTO",
                "prefix_length": 1,
                "max_expansions": 10
              }
            },
            {
              "multi_match": {
                "query": "1000364",
                "fields": [
                  "spelling.phonetic^1"
                ],
                "minimum_should_match": "100%",
                "tie_breaker": 1.0,
                "boost": 1,
                "type": "best_fields",
                "cutoff_frequency": 0.15
              }
            }
          ],
          "minimum_should_match": 1,
          "boost": 1
        }
      },
      "boost": 1
    }
```

bosch-manuel avatar Oct 10 '23 09:10 bosch-manuel

Maybe I'm confusing something. I actually want to do a fuzzy search on everything except sku and additionally match on sku using its default search analyzer (non fuzzy).

I this something, that can easily be achieved by tuning some settings ?

bosch-manuel avatar Oct 10 '23 09:10 bosch-manuel

I'll checkout the recently added experimental features related to ngram analyzer. Sounds like a solution for my issue.

bosch-manuel avatar Oct 12 '23 06:10 bosch-manuel

I'll checkout the recently added experimental features related to ngram analyzer. Sounds like a solution for my issue.

bosch-manuel avatar Oct 12 '23 06:10 bosch-manuel

Hello @bosch-manuel,

Where does this sku.sku_ngram_analyser comes from ? Is this some custom analyzer you defined for the sku ?

Regards,

rbayet avatar Oct 12 '23 08:10 rbayet

Yes, it's a custom analyser. Our use case requires different ngram sizes and additional char filters.

bosch-manuel avatar Oct 12 '23 09:10 bosch-manuel

I tried these experimental features: [Experimental] Use default analyzer in exact matching filter query [Experimental] Use all tokens from term vectors [Experimental] Use edge ngram analyzer in term vectors

This should actually cover my case: sku can be excluded from spellchecks and the spellchecker should return SPELLING_TYPE_EXACT, since edge ngram analyzer is considered in term vectors.

Unfortunately, my custom edge ngram analyzer is not supported by this feature. Everything is strictly tied to the predefined standard_edge_ngram analyzer.

I tried to override standard_edge_ngram via xml but this leads to an error when indexing.

bosch-manuel avatar Oct 17 '23 07:10 bosch-manuel

Hi @bosch-manuel

what's the status of this issue ? can you add more details eventually ?

Regards

romainruaud avatar Jul 15 '24 13:07 romainruaud

This issue was waiting update from the author for too long. Without any update, we are unfortunately not sure how to resolve this issue. We are therefore reluctantly going to close this bug for now. Please don't hesitate to comment on the bug if you have any more information for us; we will reopen it right away! Thanks for your contribution.

github-actions[bot] avatar Jul 29 '24 14:07 github-actions[bot]