organisation icon indicating copy to clipboard operation
organisation copied to clipboard

Unicode normalisation across apertium tools

Open flammie opened this issue 4 years ago • 15 comments

It seems to me that good portion of apertium IRC traffic is people checking on unicode character variants like:

10:43 +spectie> .u ô
10:43  begiak> U+006F LATIN SMALL LETTER O (o)
10:43  begiak> U+0302 COMBINING CIRCUMFLEX ACCENT (âWL̂)
10:43 +spectie> .u ô
10:43  begiak> U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX (ô)

I think this is something that the tools should take care of somehow, I'd suggest NFC normalization for all input, perhaps with a warning in compiler type tools. NFC is the nicest for most FSA letter automata. If agreed this might be a good starter task for gsoc candidates?

flammie avatar Dec 31 '20 17:12 flammie

We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.

TinoDidriksen avatar Dec 31 '20 17:12 TinoDidriksen

ICU provides an a way to define custom normalizations. The documentation isn't terribly helpful, but it looks to me like we just need to edit https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/norm2/nfc.txt to make a more conservative NFC and then use these instructions https://unicode-org.github.io/icu/userguide/transforms/normalization/ under this license https://www.unicode.org/license.html

mr-martian avatar Dec 31 '20 18:12 mr-martian

We need the non-destructive subset of NFC. E.g., we don't want "U+212B Å ANGSTROM SIGN" normalized to "U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE" or the other destructive transformations NFC performs.

Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?

flammie avatar Dec 31 '20 19:12 flammie

From what I can see, we just don't want any of the > rules. E.g. rule 212A>004B says Kelvin sign should turn into capital K.

TinoDidriksen avatar Dec 31 '20 19:12 TinoDidriksen

A quick'n'dirty shortcut would be to use a transformation that only hits grapheme clusters with combining marks. For example: echo -n 'ôôÅÅ' | uconv -x '([:^Nonspacing Mark:] [:Nonspacing Mark:]+) > &NFC($1)' | uconv -x any-name yields \N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}\N{ANGSTROM SIGN}\N{LATIN CAPITAL LETTER A WITH RING ABOVE}

It turns (U+006F U+0302) into ô (U+00F4), but doesn't touch .

However, it would touch if that had any combining marks after it. I posit that is so rare we don't have to worry.

TinoDidriksen avatar Dec 31 '20 20:12 TinoDidriksen

Excellent point, I am personally not very worried about Ångström sign but there might be something useful there as well... Perhaps we should go through the list cooperatiively somehow, the icu text file is a bit hard to parse maybe we should generate some google doc with the actual letters and stuff for collaborative editing?

https://gist.github.com/mr-martian/80d99c2ca29a36ac483cca84bbc4ec3a

Not quite collaborative editing, but hopefully at least a bit more readable

mr-martian avatar Feb 11 '21 16:02 mr-martian

https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c

And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.

mr-martian avatar Feb 11 '21 16:02 mr-martian

https://gist.github.com/mr-martian/11dd5c4dad3861b55054a209393c1e0c

And here's just the unconditional replacements, since I expect that's the part we're most interested in editing.

Hmm, this looks all ok to me, though I have no good knowledge for most scripts in the list. It doesn't seem to have anything more problematic than Å for Ångström sing and K for Kelvin sign afaics, for latin / generic?

flammie avatar Feb 12 '21 16:02 flammie

Should this be a step that apertium/apy runs before the pipeline? or something done within morph analysis? (My first thought is it seems easier and cleaner to do it before analysis)

unhammer avatar Apr 14 '21 13:04 unhammer

I would expect it to be in conjunction with format handling (either before or after, not sure which).

mr-martian avatar Apr 14 '21 13:04 mr-martian

Should deformating take care of this? Or are you thinking something in between deformating and analysis?

xavivars avatar Apr 14 '21 13:04 xavivars

Inserting a normalizer between deformatting and analysis would handle it without requiring every deformatter to be updated and also deals with the issue (that I guess was discussed on IRC rather than here) that sooner or later someone might care about normalized vs not and want to turn it off.

mr-martian avatar Apr 14 '21 13:04 mr-martian

Having it after deformatting would mean it could run on only the translated parts of the text, and not touch formatting (so that when Word2022 exports an html page with combining chars in its class names it will still look as ugly as intended)

unhammer avatar Apr 14 '21 13:04 unhammer

Relevant IRC log: https://tinodidriksen.com/pisg/freenode/logs/%23apertium/2021-02-11.log

TinoDidriksen avatar May 15 '21 13:05 TinoDidriksen

And here's a helper script I have for a similar task: https://gist.github.com/TinoDidriksen/aa6b8047e26fb6876b4b9f90c51988f3

TinoDidriksen avatar May 15 '21 14:05 TinoDidriksen