wikdict-web icon indicating copy to clipboard operation
wikdict-web copied to clipboard

Accent-insensitive search for Greek

Open karlb opened this issue 4 years ago • 5 comments

Accent-insensitive search works for latin characters, but not for Greek characters. Searching for "κοσμος" should yield results for "κόσμος".

ICU support could help with this, but is unfortunately not too easy to enable, see https://github.com/karlb/wikdict-web/issues/14.

karlb avatar Jan 05 '22 14:01 karlb

I could write a custom tokenizer using https://github.com/hideaki-t/sqlite-fts-python/. Maybe removing the diacritics with one of the approaches from https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string.

karlb avatar Jan 05 '22 16:01 karlb

The same problem exists for Swedish, where https://www.wikdict.com/de-sv/passa%20p%C3%A5 works but https://www.wikdict.com/de-sv/passa%20pa doesn't.

karlb avatar Jul 30 '22 12:07 karlb

If I ever want to move off of sqlite, https://duckdb.org/ seems to have a better choice of tokenizers while keeping many of sqlite's benefits.

karlb avatar Jul 30 '22 12:07 karlb

Using stemmers from https://github.com/abiliojr/fts5-snowball should also solve the problem. I'm not sure how much stemming should be done on a dictionary, though.

karlb avatar Sep 18 '22 12:09 karlb

The unaccent function from sqlean's unicode SQLite extension can be used to remove the accents:

sqlite> .load ./unicode
sqlite> SELECT unaccent('κόσμος');
unaccent('κόσμος')
------------------
κοσμος            

This still does not integrate it with the FTS index, but that should be doable.

karlb avatar Feb 16 '23 14:02 karlb