Can't find 'œ' when typed as 'oe'
Searching for Le Boeuf sur le Toit doesn't return the Paris restaurant, which is written as "Le Bœuf sur le Toit".
Oddly enough, if you try the analyzers for indexing and search, they both will expand the name correctly to "le boeuf sur le toit". Doing an ngram search for "boe" still yields the result. With "boeu" the search fails.
A test case for this specific term also passes without issues. So this seems to be something that only appears on a larger database.
Please use a separate issue on the discussion page for questions about setup.
The search works on small datasets because the tokens are few, and n-grams for "Bœuf" (expanded to "Boeuf") match short prefixes like "boe" easily. However, in a large dataset, the way n-grams are stored with edge n-grams and ligature expansion can cause slightly longer prefixes like "boeu" not to match any token, even though "boeuf" exists. Short prefixes are common enough to hit a match, but longer ones fail because the index splits or normalizes the token in a way that "boeu" doesn’t exactly exist. Replacing asciifolding with icu_folding resolves this because it handles all Unicode characters consistently across indexing and searching.
The ICU folding filter is more powerful than asciifolding. It converts all Unicode characters into their ASCII-equivalent forms, not just diacritics.
in opensearch/IndexSettingBuilder.java
final var NORMALIZATION_FILTERS = List.of(
"lowercase",
"german_normalization",
"asciifolding" // ← MAY BE THIS IS THE PROBLEM!
);
ICU understands complex characters like "œ", "æ", "ß".
However, the consistency of this “Better Unicode Handling Everywhere” needs to be tested for non-European scripts (Greek, Cyrillic, Arabic, etc.). ICU might transliterate them, but it could produce unexpected ASCII sequences.
I HOPE THIS HELPS, as I don't have the system to test huge data.
The asciifolder isn't the issue, it is clear that it knows about the œ:
me@machine:~$ curl -XGET "http://localhost:9201/photon/_analyze" -H 'Content-Type: application/json' -d'
{
"tokenizer": "whitespace",
"filter": ["asciifolding"],
"text": "Bœuf Boeuf"
}
' | jq
{
"tokens": [
{
"token": "Boeuf",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "Boeuf",
"start_offset": 5,
"end_offset": 10,
"type": "word",
"position": 1
}
]
}
Trying this again, the problem is actually in using german_normalization and asciifolding together:
me@machine:~$ curl -XGET "http://localhost:9201/photon/_analyze" -H 'Content-Type: application/json' -d'
{
"tokenizer": "whitespace",
"filter": ["german_normalization", "asciifolding"],
"text": "Bœuf Boeuf",
"explain": true
}
' | jq
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "whitespace",
"tokens": [
{
"token": "Bœuf",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0,
"bytes": "[42 c5 93 75 66]",
"positionLength": 1,
"termFrequency": 1
},
{
"token": "Boeuf",
"start_offset": 5,
"end_offset": 10,
"type": "word",
"position": 1,
"bytes": "[42 6f 65 75 66]",
"positionLength": 1,
"termFrequency": 1
}
]
},
"tokenfilters": [
{
"name": "german_normalization",
"tokens": [
{
"token": "Bœuf",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0,
"bytes": "[42 c5 93 75 66]",
"positionLength": 1,
"termFrequency": 1
},
{
"token": "Bouf",
"start_offset": 5,
"end_offset": 10,
"type": "word",
"position": 1,
"bytes": "[42 6f 75 66]",
"positionLength": 1,
"termFrequency": 1
}
]
},
{
"name": "asciifolding",
"tokens": [
{
"token": "Boeuf",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0,
"bytes": "[42 6f 65 75 66]",
"positionLength": 1,
"termFrequency": 1
},
{
"token": "Bouf",
"start_offset": 5,
"end_offset": 10,
"type": "word",
"position": 1,
"bytes": "[42 6f 75 66]",
"positionLength": 1,
"termFrequency": 1
}
]
}
]
}
}
Switching the two filters should fix the problem. I just need to ponder a bit more if it adds new ones.
I'll have a look into the ICU folder. Looks promising.
I'm not sure if this is related but I have similar issue with the french city Moëns. The query with the diacritic give me the right city. The query without give a lot of cities named Mons but not Moëns.