albert Unicode normalization

See https://github.com/albertlauncher/albert/issues/707#issuecomment-451973699

Oct 09 '22 07:10 ManuelSchneid3r

@Sharsie maybe you can help me with some knowledge. I know nothing about normalization. Sure I am going to look things up but maybe you can help me with some easy digestible information. Especially about these forms like C D KC KD etc. QSrring hat a function for it but the docs are pretty quiet about what it actually does.

https://doc.qt.io/qt-6/qstring.html#normalized

Mar 30 '23 08:03 ManuelSchneid3r

@ManuelSchneid3r It pretty much just follows the unicode standard for normalization. This article describes the process in detail.

In short D stands for decomposition C stands for decomposition and composition K stands for compaitbility

So basically (very basically...and inacurately), all normalization forms will first take the input character, and decompose (or split) the char into their canonical representations in unicode. Good example (in the article as well) is the Ohm symbol you know from electricity. It's written as a greek letter Omega. So if you run the Ohm (unicode \u2126) through any form of normalization, it will be converted to Omega \u03A9 even though they look the same

in this case, both D and C (and their K variants) do the same, because Ohm decomposes into a single unicode character Omega \u03A9 and nothing else, so even if composition takes place after, there is only a single character to compose

Then you have stuff like german ö or \u00F6 in unicode. The decomposition takes place in all cases, converting the ö into o (\006F note that 00F6 becomes 006F) and the double dot mark above. In case of D form, you end up with the first character, which is plain o, because it will not perform composition of the split characters and will not combine them together again. However the C form will combine them back together and you end up with ö again represented by \u00F6. This happens because ö is a canonical version of ö. This description is not accurate for many other characters, but should explain the basics.

Then K form comes into play. This is useful for stuff like roman numerals.. say Ⅲ or \u2162 - what the K versions of both C and D form will do is that they will check for whether there is compatible representation in canonical letters, so in the case of roman three, it will convert it to three separate Is Now there are some weird cases like ẛ̣ or \u1E9B\u0323 - this is mentioned in the article - where the K forms will convert it to s

In summary: The normalization form really depends on the use case. Because you do not know what a plugin developer will want to compare it to, or do with the input. I don't think it is a wise decision to just pick one and force it. If the normalization is needed, then I think NFC is the proper one to use, because it causes the least side effects.

Although I would still vote against it in general, if I were to give an example, I think the Ohm vs Omega is a nice one. Say you have a plugin, where you type in 330Ω and the plugin will show you the color coding of a resistor. Now if somebody were to type in the Ohm symbol \u2126 (or copy the value from somewhere), using NFC would convert it to a 330Ω (Omega symbol \uu03A9). Now the plugin developer has to be aware of such normalization and has two options:

match against both Ohm and Omega
Or more safely, normalize the Ohm symbol using NFC and then compare it

Option 1 is fine, as long as you are aware that something like this is happening under the hood, or even exists. Option 2 is a bit wonky... the developer needs to know how the original query string was normalized and they have to use the same normalization form to get consistent results.

I would answer with a different implementation though: The developer should normalize both the input and the string they are comparing against. Using the same normalization technique on both would provide consistent results. To be safe, they should normalize the input even if it was normalized before, to ensure the same normalization form is used for comparison. Although this could still result in inconsistencies, if different forms are used, but I reckon it is safer than not doing so. The best case scenario: they just use the same normalization form and because the normalization is idempotent, it will not make a difference to the input.

Optionally, multiple normalizations could take place. Example: You have a directory with movies:

Louis Ⅳ (\u2163) von Österreich
Henry III (single Is)

You are searching through and even if you type Henry Ⅲ or Henry III, both should find the movie. Same goes for Louis, Louis IV, Louis Ⅳ, Ⅳ von ost or IV von ost should find it

To achieve that, you will have to do run comparison using multiple normalizations seperately input NFD + comparison NFD input NFKD + comparison NFKD input NFD + NFKD + comparison NFD + NFKD (won't change this example, but to be on the safe side I suppose?)

So I hope I didn't make a mistake and I explained the rationale why I am against this :)

Mar 30 '23 21:03 Sharsie

To comment on the #707 - I didn't fully read through this, but unicode normalization will not help there anyway.

If I got that right, the goal is to match using fuzzing, so stuff like "I don't" match for "I dont" or even "Idont".

Normalization will probably not help there in any way... even if you do something like stripping ' The plugin developer would still have to strip it from their results of found mp3 files.

If there is an answer, then it's probably offering some kind of tooling from the albert core. That could be some basic normalization techniques, where the plugin developer could utilize it to convert both the input query and their results... you could probably go as far as to returning multiple normalized strings

So from an input such as I don't know Louis Ⅳ from Österreich, you would return multiple results:

I don't know Louis Ⅳ from Osterreich
I don't know Louis IV from Osterreich
I dont know Louis Ⅳ from Osterreich
I dont know Louis IV from Osterreich

But honestly, it's hard to say what the developer wants from it. The unicode is just too broad... is ' enough? Is there something else? Do you want to strip parantheses? Do you want to strip dots? Or is it the oposite way, do you actually require dots/parantheses etc? It's just too specific to the use case in my opinion.

However; Fuzzing... that's another story... if you were to implement a tooling for comparison using fuzzing, where one could send in an input string and a dictionary (key value) of strings, you would search through the dictionary, normalize both input and the dictionary values, compare it using fuzzing and then return keys that match, optionally with some form of score value... that would be amazing... Although I'm not sure how difficult such implementation would be...I have worked with tools like Elasticsearch in the past and although these tools are amazing at what they do, that always opens the realm of increasingly complex crap. So unless this can be implemented in a simple manner, bundling a complex search engine in albert is probably not the right way to go :)

Mar 30 '23 22:03 Sharsie

Do we need more than "ignore diacritics"? I dont know how to handle things like apostrophes. I guess there is much more knowledge of a language required to handle things like this. Also stemming is something rather complex and probably not worth the hassle.

Jun 25 '24 21:06 ManuelSchneid3r

albert albert copied to clipboard

Unicode normalization

albert
albert copied to clipboard