libosmscout icon indicating copy to clipboard operation
libosmscout copied to clipboard

Use UTF8 Transliterate together with Marisa index

Open Karry opened this issue 3 years ago • 2 comments

Regular address search may use utf8 transliterate method to match results when search term don't match exactly. For example term "trebon" match to region "Třeboň". But prefix search based on Marisa library require exact match. So, term "jested" don't match to hill "Ještěd". It may be solved by importing transliterated strings to Marisa indexes. This approach was refused when iconv-based transliteration was used, because it is platform and locale dependent. But now, we have platform independent utf8helper that may help with this task.

It would be great to measure size different of Marisa index when both - original string and transliterated string - are imported. Or if import of transliterated strings is enough for common use cases.

What do you think @janbar ?

Karry avatar Jan 31 '22 23:01 Karry

I vote in favor doing this transliteration for Marisa search, at least with a option at import time. I asked for this a long time ago because as is Marisa search is almost unusable without doing some normalization and transliteration, even capitalization maters in the search, and in french there are often "-" between compound names but not always...

vyskocil avatar Feb 07 '22 10:02 vyskocil

It would be a great thing.

Could be : TransformTransliterate + TransformNormalize It transliterates extended character, then normalize i.e removing extra spaces, separators, ponctuation.

janbar avatar Feb 09 '22 19:02 janbar