rsync icon indicating copy to clipboard operation
rsync copied to clipboard

iconv filename conversion partly fails on mac

Open Tabiskabis opened this issue 5 years ago • 8 comments

Situation

I'm using rsync (3.2.2) on MacOS (10.14) to sync parts of thumbdrives (FAT/NTFS formatted) to a remote Synology NAS (rsync 3.0.9), where they are accessible over SAMBA. Some filenamens contain accents and umlauts (e.g. défilé, Überraschung). MacOS uses some crazy non-standard normalization to encode these in decomposed UTF8. Therefore i'm using the --iconv=UTF8-MAC,UTF-8 option. So far so good.

Bug/appearance

The problem arises with filenames that contain high surrogate pairs (e.g. U+1F4A9 💩 pile of poo, U+1D6F0 𝛰 mathematical italic capital omicron). MacOS' iconv will not convert these: echo 𝛰 | iconv -f utf-8-mac -t utf-8 iconv: (stdin):1:0: cannot convert Therefore rsync fails to (not-)convert them as well: [sender] cannot convert filename: 🤬you too.msg (Illegal byte sequence) even though they don't need to be converted, as they already are the same normalization and encoding that Linux uses.

Problem

I can't work arround this issue by rsyncing twice, as i need to use --delete. Also, there are already files containing both umlauts and emojis ("Süße Vögel 😍.jpg"). Not using iconv creates wrongly normalized umlaut-containing filenames on the target, which practically renders them inaccessible on the Synology NAS and sharing over SAMBA.

Cause

It seems that 4-byte UTF8 characters will just not convert from UTF8-MAC to [any encoding scheme]. Now you might argue that this is in fact an inssue with Apple's (lib)iconv. That may be technically true, but Apple is surely not going to fix/extend that code anytime soon, if at all, and especially not for their security-fixes-only OS. I already read about SATOH Fumiyasu's project https://github.com/fumiyas/libiconv-utf8mac that probably would fix the issue if rsync is linked with it instead of Apple's libiconv - but i'm unable to do that. I am not a dev.

Fix ?

Again, I'm merely a junior sysadmin, not a dev. But I suppose rsync (when linked agains Apple's libiconv) should detect the use of --iconv=UTF(-)8-MAC,UTF(-)8. Then not convert whole filenames but only single unicode characters, and not try to convert, but just pass through UTF8 4-byte long Unicode characters (U+D800 to U+DFFF and >= U+10000). Or maybe rsync could just incorporate said libiconv-utf8mac?

Tabiskabis avatar Jul 29 '20 15:07 Tabiskabis

I have that problem and it's annoying. Are you sure that https://github.com/fumiyas/libiconv-utf8mac would solve the problem ? This library handle well 4 bytes UTF8 ? If yes, I can probably generate an executable linked against that library.

jief666 avatar Jan 20 '21 20:01 jief666

Not entirely sure, but it says "Support surrogate pairs" in their README.md.

Tabiskabis avatar Jan 23 '21 18:01 Tabiskabis

Once again, i've stumbled over this issue. It's still unfixed. This time, i've frankensteined fumiyas libiconv by including its utf8mac.h in the curent gnu libiconv and compile that. the resulting iconv binary processes the respective filenames successfully. However, i'm still not enough of a dev to compile rsync against this specific libiconv

Tabiskabis avatar May 27 '22 18:05 Tabiskabis

fumiyas libiconv doesn't already include utf8mac.h ? They say "Work: UTF-8-MAC support: Apple libiconv-51.200.6 (utf8mac.h)"

jief666 avatar May 27 '22 18:05 jief666

yes it does, but i could not compile it. so i simply copied that file into gnu libiconv, which i'm able to compile. DANG yesss! i just compiled rsync with mentioned frankenstein version of gnu libiconv with fumiyas utf8mac.h and it works 💯

Tabiskabis avatar May 27 '22 18:05 Tabiskabis

I have an xcode project to compile rsync. So I guess I could compile it against this lib...

jief666 avatar May 27 '22 18:05 jief666

I have a bit of time to work an this. I tried to reproduce (with a file containing 💩 and another with 🤬) but it worked... It fixed ?

jief666 avatar Jul 06 '22 12:07 jief666

It is not fixed. Try to sync a file named "Mäc🤬OS.txt" from a Mac to any non-mac system and try to open it in a terminal, without copy-pasting the filename from the Mac's clipboard or from an ls listing. Use an on-screen-keyboard input instead, or some other way of unicode input. This makes sure you're using actual precomposed characters and are not accidentially copy-pasting the decomposed form (or any mixture).

My endboss testfile is named "Mäc🤬𝛰S🤦🏻‍♂️!☝🏻?㍙가-𝄞 ﷽ x̷͈̗̫̖̻̣͎̉͗̒̅̽̕͠.txt", file content = filename. It does not even display correctly in my browser and guarantees some fun with filebrowsers and consoles different rendering abilities. cat it (twice) for example. Question mark to provoke even more fun with Windows.

Why would anyone include such weird characters in their file names? Text messages and e-mail subject lines put in filenames by very clever software. This is how i stumbled upon the issue in the first place.

Niche humor aside, I have realized in the meantime that a proper fix would not include iconv at all, but would either add some Mac specific code to normalize filenames before sending (edit: unintuitive, not how things usualy work) ~~and convert received names, possibly using MacOS functions~~ (edit: not required, already works like that), or the receiving rsync would have to normalize the filenames (I'm confused why rsync does this on Mac but not on Linux). Either way I can see havoc happening in existing deployments that would suddenly wind up with two differently normalized named copies of files.

Tabiskabis avatar Jul 06 '22 23:07 Tabiskabis