staticSearch icon indicating copy to clipboard operation
staticSearch copied to clipboard

We should standardize on Unicode Normalization Form C

Open martindholmes opened this issue 3 years ago • 4 comments

All of our tokenizing and stemming should incorporate Unicode Normalization. I realized when working a dictionary project that users will frequently type search queries using keyboard setups that don't generate form C, and if our stemmers etc. are not expecting this, we'll miss the occasional hit. I think both the XSLT and JS stemmers should all normalize to form C prior to stemming, the tokenizer should normalize all search contexts, and an extra step in the search page should pre-normalize the entire query. I think this qualifies as a potential bug.

martindholmes avatar Nov 01 '21 19:11 martindholmes

I don't have much to add other than this all sounds good to me :-)

joeytakeda avatar Nov 02 '21 21:11 joeytakeda

Working on this in branch issue-179-unicode-nfc

martindholmes avatar Dec 16 '21 17:12 martindholmes

I think this is done now.

martindholmes avatar Jan 18 '22 16:01 martindholmes

I'm re-opening this because I don't believe I actually finished it; all the changes to stemmers were made, but the crucial change that would normalize the search box input was not. I think this should be retro-fitted to 1.4 and a bugfix release done. The JS stemmers all do the normalization, but I think it would also help to do a normalization of the initial input prior to any other processing. Also, the typeahead controls should do normalization before processing.

martindholmes avatar Nov 29 '22 17:11 martindholmes