pucu
pucu copied to clipboard
NFC Normalization of Å
Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint $00C5
) produces the sequence $0041 $030A
. This is correct.
However, composing the sequence $0041 $030A
produces the codepoint $212B
(Angstrom Sign).
$00C5
and $212B
are equivalent codepoints but their normal form is $00C5
so the composition is wrong.
For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).