pucu icon indicating copy to clipboard operation
pucu copied to clipboard

NFC Normalization of Å

Open andersmelander opened this issue 1 year ago • 4 comments

Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint $00C5) produces the sequence $0041 $030A. This is correct. However, composing the sequence $0041 $030A produces the codepoint $212B (Angstrom Sign).

$00C5 and $212B are equivalent codepoints but their normal form is $00C5 so the composition is wrong.

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).

andersmelander avatar May 20 '23 00:05 andersmelander