unicode normalization

Open ThomasWaldmann opened this issue 5 years ago • 1 comments

we usually have utf-8 content in text items (including links and transclusions by-name) and we usually also use utf-8 encoding for item names.

unicode content and names should get normalized before they are encoded to utf-8 and stored, otherwise stuff can get inconsistent (especially if people use apple devices).

i recently noticed this there:

https://github.com/moinwiki/moin-1.9/issues/59
https://github.com/borgbackup/borg/issues/4771

while moin2 is not released and there is no "production" content in it, we can avoid that stuff gets inconsistent.

i'ld suggest we always normalize unicode text (names, text item content) to NFC form before storing it into backend.

that way we can avoid that a NFD link to an NFC named item looks correct, but does not work.

any text that comes from the user must go through that normalization (e.g. when entering stuff in form fields).

this is not a problem in English, because all is plain ascii, but for a lot of other languages, like german, french, spanish, ...

For example, take the german a-umlaut (both print outputs look the same in a terminal, but not on github):

# NFC normalization (composed):
>>> print("\xc3\xa4".decode('utf8'))
ä
# NFD normalization (decomposed):
>>> print("a\xcc\x88".decode('utf8'))
ä

Jul 14 '20 17:07 ThomasWaldmann

of course the importer from moin-1.9 also needs to normalize (page content, page names, attachment names).

about attachment content: guess if it ends up being a text/*;coding=utf-8 item, we should also normalize content to NFC form.

usually this should not change the coding as NFC is the usual stuff, just apple does it differently when it comes to filesystem names.

Jul 14 '20 17:07 ThomasWaldmann