api icon indicating copy to clipboard operation
api copied to clipboard

Indexing and normalisation of Cyrillic characters

Open taygun opened this issue 3 years ago • 2 comments

Describe the bug When searching for the address ("Олега Оникієнка вулиця 77а") of this OSM place no result are returned. The issue seems to be caused by the fact the the address is indexed with Cyrillic "a". If the query search contains the Cyrillic character "a", the above address is returned.

Steps to Reproduce

Steps to reproduce the behavior: No results returned when searched with Latin Small Letter A: pelias.github.io Result returned when searched with Cyrillic Small Letter A: pelias.github.io

Expected behavior Expected the address to be returned when using Latin character

taygun avatar Sep 05 '22 09:09 taygun

Hmm yes I can confirm the issue you are seeing, it seems to be affecting queries to the /v1/autocomplete endpoint but not the /v1/search endpoint, which helps narrow down the scope.

We use the icu-folding filter in elasticsearch to 'fold' the Cyrillic form to the Latin form.

It seems as though we are using this filter correctly in all of the analyzers, with the exception of peliasHousenumber which has a numeric character filter, and so it doesn't apply.

I'm not really sure what's going on here, the expected behaviour is that we fold Cyrillic to ASCII for precisely this purpose.

missinglink avatar Sep 06 '22 12:09 missinglink

Ah, very nice discovery @missinglink. I think we originally discovered this issue back in https://github.com/pelias/pelias/issues/833 but never narrowed down the cause.

It feels like adding the icu-folding filter is relatively safe, maybe we should try that out?

orangejulius avatar Sep 06 '22 13:09 orangejulius