staticSearch
staticSearch copied to clipboard
We should standardize on Unicode Normalization Form C
All of our tokenizing and stemming should incorporate Unicode Normalization. I realized when working a dictionary project that users will frequently type search queries using keyboard setups that don't generate form C, and if our stemmers etc. are not expecting this, we'll miss the occasional hit. I think both the XSLT and JS stemmers should all normalize to form C prior to stemming, the tokenizer should normalize all search contexts, and an extra step in the search page should pre-normalize the entire query. I think this qualifies as a potential bug.
I don't have much to add other than this all sounds good to me :-)
Working on this in branch issue-179-unicode-nfc
I think this is done now.
I'm re-opening this because I don't believe I actually finished it; all the changes to stemmers were made, but the crucial change that would normalize the search box input was not. I think this should be retro-fitted to 1.4 and a bugfix release done. The JS stemmers all do the normalization, but I think it would also help to do a normalization of the initial input prior to any other processing. Also, the typeahead controls should do normalization before processing.