wikdict-web Accent-insensitive search for Greek

Accent-insensitive search works for latin characters, but not for Greek characters. Searching for "κοσμος" should yield results for "κόσμος".

ICU support could help with this, but is unfortunately not too easy to enable, see https://github.com/karlb/wikdict-web/issues/14.

Jan 05 '22 14:01 karlb

I could write a custom tokenizer using https://github.com/hideaki-t/sqlite-fts-python/. Maybe removing the diacritics with one of the approaches from https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string.

Jan 05 '22 16:01 karlb

The same problem exists for Swedish, where https://www.wikdict.com/de-sv/passa%20p%C3%A5 works but https://www.wikdict.com/de-sv/passa%20pa doesn't.

Jul 30 '22 12:07 karlb

If I ever want to move off of sqlite, https://duckdb.org/ seems to have a better choice of tokenizers while keeping many of sqlite's benefits.

Jul 30 '22 12:07 karlb

Using stemmers from https://github.com/abiliojr/fts5-snowball should also solve the problem. I'm not sure how much stemming should be done on a dictionary, though.

Sep 18 '22 12:09 karlb

The unaccent function from sqlean's unicode SQLite extension can be used to remove the accents:

sqlite> .load ./unicode
sqlite> SELECT unaccent('κόσμος');
unaccent('κόσμος')
------------------
κοσμος

This still does not integrate it with the FTS index, but that should be doable.

Feb 16 '23 14:02 karlb