syncthing
syncthing copied to clipboard
Rename special/unsupported characters in filenames
As mentioned in https://github.com/syncthing/syncthing-android/issues/192 , some filenames are not accepted by windows hosts because they contain 'special characters' like colons or bars.
@schuft69 suggested to add a "Rename special characters to '_'" option to resolve this issue.
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
This needs to happen on the source then, or we need to keep the translation in the database somewhere, and hilariousness ensues when the database is reset or lost.
Maybe there could be a warning on the source, inviting to rename the file to avoid problems.
I would think that with the substitution of characters it wouldn't really be required on each machine's database... would it not be possible to just check files incoming/outgoing against the substituted version, if they exist already on the host or don't, do what's appropriate?
I would think that with the substitution of characters it wouldn't really be required on each machine's database... would it not be possible to just check files incoming/outgoing against the substituted version, if they exist already on the host or don't, do what's appropriate?
No. Consider a case where a *nix machine has two files in a directory, and their contents are unique:
$ echo 'first file' > 'foo:bar'
$ echo 'second file' > 'foo>bar'
$ ls -1
foo>bar
foo:bar
$ cat 'foo:bar'
first file
$ cat 'foo>bar'
second file
Simply renaming each file by substituting offending characters would not work. You would end up with two conflicting files, each named foo_bar
.
Also, consider this scenario:
- A user has a *nix machine with a file on it named
foo:bar
. - The user adds a Windows machine to the syncthing cluster. The file is copied to the Windows machine as
foo_bar
. - The user adds a second *nix machine to the syncthing cluster. This second *nix machine receives a file named
foo:bar
from the first *nix machine and a file namedfoo_bar
from the Windows machine.
Oops. One file magically became two files.
Moreover there are other forbidden names under Windows: for example, nul or com1 and several other, or names longer than 255 characters. And names cannot end with a dot or space. And Mac could be a problem too: I think it requires normalized unicode, while in Linux normalized and non-normalized version could be two different files. (I'm just remembering, cannot verify right now, maybe there are some imprecisions).
We enforce unicode normalization; there's been some pain around that, for people having files with the "wrong" normalization for their OS. There we actually silently fix it, unless configured otherwise.
for example, nul or com1 and several other, or names longer than 255 characters. And names cannot end with a dot or space.
:scream: I'd forgotten about the first of that, and didn't know about the second. We should probably handle that too (at least with a reject). Dammit, Windows...
But @Ichimonji10 above summarizes quite nicely why we probably won't be doing character substitutions anytime soon.
Just to add all the information in one place: you have to handle case sensitivity too.
Moreover, the problem is not with Windows bit also affects NTFS with other OSs. While in Linux shares under Windows have an error if trying to create a wrong filename, direct NTFS mounts do not have errors by default, and files are accessibile under Linux but not under Windows. So, you cannot rely on errors to know if a filename is legal.
This is maybe even crazier idea, but how about simple escaping? On windows, invalid character in incoming file names can be substituted, All*files?
to All^afiles^q
, for example, and reversed substitution can be used for outgoing files. Substitution may be configurable per-repository and enabled on Windows by default, so this would help even with samba mounts on *nixes.
We could do escaping, but it would have to be to something the user would not enter herself (to avoid confusion), and it would be ugly. I.e. foo:bar
-> st--foo%3abar
. The point being that we should be able to recognize the escaping even if the database is gone and we're doing an initial scan. I have files with legitimate ^
's in them (well, legitimateness can be debated, but they are there) and probably the same goes for someone else and whatever other character we could think of.
The case sensitivityness is an open bug somewhere else.
And I don't think should try to guess hidden rules when the OS and filesystem are fine with them...
I have files with legitimate ^'s in them (well, legitimateness can be debated, but they are there) and probably the same goes for someone else and whatever other character we could think of.
True, but that can be "solved" by escaping escape character, so incoming file^2
would become file^^2
. Of course, if user manages to create file^a
manually on Windows machine, Linux will receive file*
. I can't argue it's not ugly, but it's not ambiguous.
There is also thing that cygwin does - they are translating invalid characters to character from unicode private space. That may, in plain theory, create conflicts, but chances are really slim and it looks better.
To be honest, I think the much simpler solution is for people who sync across multiple OS:es to just stick to the lowest common denominator in file names or live with the errors. It's not that onerous, and people using more than one platform should be kind of aware of the issue. For cases where someone has just a Windows box and a NAS running Linux, they'll most likely create all their files from the Windows side and automatically stay within the limitations.
(The case sensitivity thing still needs to be handled better though.)
You're right, but there is at least one case in which I'd like a solution.
I use it to backup my Android photos. If I take a photo with Hangouts app,
the file gets saved with a :nopm:
suffix.
I think the much simpler solution is for people who sync across multiple OS:es to just stick to the lowest common denominator in file names or live with the errors.
There are two problems with this approach.
- Sometimes user can't choose what characters are included.
- User usually realizes that
[:-)]
is invalid Windows filename only after he can't find it on other side. And that may be especially sweet realization if source machine is already offline.
I don't see point number one related to syncthing; that'd be a problem for the poor user on Windows with someone forcing invalid filenames on him no matter the delivery mechanism? Point number two implies the user is also a Linux user, who I'm assuming are more aware about things like filesystems and name limitations?
(This is all to say that I think this should be solved in a clean way, not that we shouldn't solve it at all. But something that is really ugly or has bad side effects is probably not worth it IMHO.)
Point number two implies the user is also a Linux user, who I'm assuming are more aware about things like filesystems and name limitation?
I can tell from experience that it doesn't works like that :D But yeah, it is not biggest issue in known universe :)
Android is Linux, but I don't think the majority of Android users even know what a filesystem is.
Android is Linux, but I don't think the majority of Android users even know what a filesystem is.
Now when you mention it... I'm ot saying anything bad about their skill level, but this will concern MacOS users as well...
Mac OS could be a valid concern, when sharing files with Windows. Mac has an awesome historic "feature" around path separators and filenames too:
:open_mouth:
Android I'm not so worried about - specifically, I don't think users there generate files with weird names too often?
Android I'm not so worried about - specifically, I don't think users there generate files with weird names too often?
I don't know what the larger Android population does. But I personally sync photos between my Android phone and PC, which means that I end up with file names like:
IMG_20140725_184948.jpg
IMG_20140725_184948 (cropped).jpg
Tamalika Mukherjee in Balinese (aksara Bali).png
Also, that screenshot is terrifying.
@Ichimonji10 I agree, but if you take a photo with Hangouts you also have something like:
IMG_20140725_184948:nopm:.jpg
I think it would be good if Syncthing could tackle the problem. It's the kind of problem that needs to be tackled when building a sync solution for everyone that is easy to use. Most people would not mind if ":" is converted to "_" and those that do could change the default setting, whereas not syncing a file at all has bigger impact.
Dropbox replace trailing spaces in filenames on any platform, when first detected. If you create a file "test " with trailing space on Mac, then as soon as Dropbox detects it (even if you aren't syncing with any other platforms), it will rename the file locally to "test" without the trailing space.
I am working on a new kind of sync app for Ronomon, and recently worked on replacing reserved characters with underscores.
These are the characters that would need to be replaced to support almost every platform:
- " Double Quote
-
- Asterisk
- : Colon
- < Less Than
-
Greater Than
- ? Question Mark
- | Pipe
- NUL Character (Byte 0)
- Control Characters (Bytes 1-31, Byte 127)
- Leading Hyphens (cause problems with many command line tools on Linux and Mac)
- Trailing Dots
- Trailing Spaces
- Parent directory alias (..)
- Current directory alias (.)
- Home directory alias (~)
These characters should be replaced with as many underscores.
And then these device names also need to be modified slightly because AUX and AUX.txt are invalid on Windows but AUX_.txt is fine:
- $IDLE$
- AUX
- COM1
- COM2
- COM3
- COM4
- COM5
- COM6
- COM7
- COM8
- COM9
- CONFIG$
- CON
- CLOCK$
- KEYBD$
- LPT1
- LPT2
- LPT3
- LPT4
- LPT5
- LPT6
- LPT7
- LPT8
- LPT9
- LST
- NUL
- PRN
- SCREEN$
- $AttrDef
- $BadClus
- $Bitmap
- $Boot
- $LogFile
- $MFT
- $MFTMirr
- pagefile.sys
- $Secure
- $UpCase
- $Volume
- $Extend
These reserved device names should be automatically appended with an underscore (e.g. "AUX_" or "aux_.txt").
Please let me know if something is left out here.
If it would be helpful here, then these are some key ideas which might make the problem manageable and not so hard:
- Rename the files on any platform as soon as they are spotted. Renaming them only when they reach a platform where they are invalid only delays the rename and may surprise the user later.
- The file should be renamed across the cluster. i.e. There should be no special mapping to preserve invalid characters on platforms which allow them.
- When renaming the file locally when the file is first detected by the scanner, first check if another file already exists with the proposed rename. If it does, then add a "(Reserved Character Conflict 1)" label to the end of the filename but before the extension, and then try again or increment the conflict count until the destination is unique.
- Take care with hidden files (e.g. ".hidden:file") to not add the conflict label before the period. In this case there is no extension and the conflict label needs to be added on the right hand side of the period not the left hand side (Dropbox gets this wrong).
- Rare case: Make sure that after replacing reserved characters, the filename has not been inadvertently converted into an Apple Double file (which starts with "."), if it was not previously an Apple Double file before the replacement. If it is now an Apple Double file, then convert the "." prefix to ".-", i.e. use a dash instead of an underscore.
- Very rare case: Make sure that after replacing reserved characters, the filename has not been inadvertently converted into a ".DS_Store" file. If it was, then convert the "_" to "-", i.e. use a dash instead of an underscore.
- Very rare case: On Windows, certain short 8.3 filenames with no corresponding long filename and also containing a tilde, such as "SECURE~1.TXT", can cause rare conflicts with other files such as "SecureSocketsLayer.txt" and "SecureFTPServer.txt", depending on the order in which they are synced from different machines, if they are resolved by Windows to the same short 8.3 filename. These conflicts cannot be resolved in the usual manner by appending a conflict label to the filename. Instead, the tilde in these short 8.3 filenames should be automatically replaced with an underscore (e.g. "SECURE_1.TXT") when first detected in filenames on any platform, so that these short 8.3 filenames can continue to be synced.
This should cover:
exFAT (http://en.wikipedia.org/wiki/ExFAT) VFAT (http://en.wikipedia.org/wiki/File_Allocation_Table#VFAT) NTFS (http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations) HFS+ (http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations)
For non-case-preserving filesystems (FAT12, FAT16, FAT32), we may also need to replace "+,.;=[]!@" but I have not tested this yet.
Hope this helps.
Well, none of these characters are problem on decent filesystems used on Linux. There can be problem on Android, since it uses FAT on SD cards.
Anyway, simply replacing these with '_' is bad idea, because it causes data loss. There should be bidirectional mapping which allows keep original filename on one computer and modified on another.
Joran Dirk Greef wrote, on 7.5.2015 10:42:
I think it would be good if Syncthing could tackle the problem. It's the kind of problem that needs to be tackled when building a sync solution for everyone that is easy to use. Most people would not mind if ":" is converted to "_" and those that do could change the default setting, whereas not syncing a file at all has bigger impact.
Dropbox replace trailing spaces in filenames on any platform, when first detected. If you create a file "test " with trailing space on Mac, then as soon as Dropbox detects it (even if you aren't syncing with any other platforms), it will rename the file locally to "test" without the trailing space.
I am working on a new kind of sync app for Ronomon, and recently worked on replacing reserved characters with underscores.
These are the characters that would need to be replaced to support almost every platform:
- " Double Quote
- * Asterisk
- : Colon
- < Less Than
Greater Than
- ? Question Mark
- | Pipe
- NUL Character (Byte 0)
- Control Characters (Bytes 1-31, Byte 127)
- Leading Hyphens (cause problems with many command line tools on Linux and Mac)
- Trailing Dots
- Trailing Spaces
- Parent directory alias (..)
- Current directory alias (.)
- Home directory alias (~)
These characters should be replaced with as many underscores.
And then these device names also need to be modified slightly because AUX and AUX.txt are invalid on Windows but AUX_.txt is fine:
- $IDLE$
- AUX
- COM1
- COM2
- COM3
- COM4
- COM5
- COM6
- COM7
- COM8
- COM9
- CONFIG$
- CON
- CLOCK$
- KEYBD$
- LPT1
- LPT2
- LPT3
- LPT4
- LPT5
- LPT6
- LPT7
- LPT8
- LPT9
- LST
- NUL
- PRN
- SCREEN$
- $AttrDef
- $BadClus
- $Bitmap
- $Boot
- $LogFile
- $MFT
- $MFTMirr
- pagefile.sys
- $Secure
- $UpCase
- $Volume
- $Extend
These reserved device names should be automatically appended with an underscore (e.g. "AUX_" or "aux_.txt").
Please let me know if something is left out here.
If it would be helpful here, then these are some key ideas which might make the problem manageable and not so hard:
Rename the files on any platform as soon as they are spotted. Renaming them only when they reach a platform where they are invalid only delays the rename and may surprise the user later.
The file should be renamed across the cluster. i.e. There should be no special mapping to preserve invalid characters on platforms which allow them.
When renaming the file locally when the file is first detected by the scanner, first check if another file already exists with the proposed rename. If it does, then add a "(Reserved Character Conflict 1)" label to the end of the filename but before the extension, and then try again or increment the conflict count until the destination is unique.
Take care with hidden files (e.g. ".hidden:file") to not add the conflict label before the period. In this case there is no extension and the conflict label needs to be added on the right hand side of the period not the left hand side (Dropbox gets this wrong).
Rare case: Make sure that after replacing reserved characters, the filename has not been inadvertently converted into an Apple Double file (which starts with "./"), if it was not previously an Apple Double file before the replacement. If it is now an Apple Double file, then convert the "./" prefix to ".-", i.e. use a dash instead of an underscore.
Very rare case: Make sure that after replacing reserved characters, the filename has not been inadvertently converted into a ".DS_Store" file. If it was, then convert the "_" to "-", i.e. use a dash instead of an underscore.
Very rare case: On Windows, certain short 8.3 filenames with no corresponding long filename and also containing a tilde, such as "SECURE~1.TXT", can cause rare conflicts with other files such as "SecureSocketsLayer.txt" and "SecureFTPServer.txt", depending on the order in which they are synced from different machines, if they are resolved by Windows to the same short 8.3 filename. These conflicts cannot be resolved in the usual manner by appending a conflict label to the filename. Instead, the tilde in these short 8.3 filenames should be automatically replaced with an underscore (e.g. "SECURE_1.TXT") when first detected in filenames on any platform, so that these short 8.3 filenames can continue to be synced.
This should cover:
exFAT (http://en.wikipedia.org/wiki/ExFAT) VFAT (http://en.wikipedia.org/wiki/File_Allocation_Table#VFAT) NTFS (http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations) HFS+ (http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations)
For non-case-preserving filesystems (FAT12, FAT16, FAT32), we may also need to replace "+,.;=[]!@" but I have not tested this yet.
Hope this helps.
— Reply to this email directly or view it on GitHub https://github.com/syncthing/syncthing/issues/1734#issuecomment-99775076.
I could possibly see a rename-on-sight as described above as an optional must-be-turned-on-manually feature. We currently do that silently for incorrect unicode normalization, and I could possibly see us doing it by default for trailing space (seldom intended I think, and apparently causes issues in cross compatibility), but the rest is not something we should do by default for sure.
Dropbox has a webservice to check all your files for potential conflicts (https://www.dropbox.com/help/145 and https://www.dropbox.com/bad_files_check) and doesn't rename them by default.
Maybe an empty "illegal filenames present, run the bad_filename_checker" file or something like that could be inserted instead of a file that cannot be created on a certain platform, so the file can be renamed properly. Alternatively name the file temporarily to its sha256 hash or whatever syncthing uses as UUID internally so the content can be synced and only the file name gets lost until it is properly named for your platform and can still be used to seed the contents.
This is not a simple problem. Normally "network path names" consist of a list of directory names and a a file name. Each of these components can be any string of eight bit bytes except for the NUL byte which is used to terminate each component. The directory separator can be different on every host. But it is very unusual nowadays for the /
character not to be a valid separator. (However, care should be taken for aliases such as the \
character on Windows so that a pathname cannot be constructed to go UP in the directory tree.
The components can (in theory) be any string including .
..
and the null string. The first two are a problem for most OSes (all?) including Unix (linux). The null string component may be a problem on Windows because \\
is special.
Windows has a very long list of problematic names; it would be very insensitive to export these limitations to other operating systems, especially as it includes case insensitivity.
In addition many of these limitations become security issues, for example translating overlong UTF-8 sequences onto a host that uses UTF-16 can manufacture NUL or \
characters from plain looking byte streams. What's more there's nothing wrong with these "non-UTF-8" file names on a Unix system so they should be transferred unmangled between Linux machines.
It would be nice if a unique "network path name" could be mapped to a unique local path name; this is mostly possible on Unix as it'd be simple enough to add a "someone tried to be evil" prefix to the unwanted file names. But the dumb "sort of case sensitive only not" behaviour of Windows defeats this.
Then, the case insensitivity of Windows is even worse than you think. It isn't the same everywhere. It depends on the localisation that Windows is running under and the version of windows that you're running (eg: Unicode characters that don't exist on an earlier version of windows can't be "case smashed" but can be later ).
Obviously, it's impossible to tell what workarounds may be needed on another peer, so all a particular node can do is look out for itself. The security impacts must be addressed, but it's probably impossible to fix case insensitivity. With luck you'd be able to manufacture a collision, make believe that the file has just been created on the problematic machine and force some sort of rename across the entire "swarm". Making it a unique rename will be the trick.
I personally don't find very attractive the idea to apply Windows' puzzling naming constraints on the other OSes. What if a user is syncing files for an application that rely on file names that would be illegal in Windows? And I don't think we can design an encoding that would be bi-directional (allowing to restore the original name from the encoded one) without inserting a fair amount of confusion for the end user.
We may as well let the Windows nodes deal with the Windows problem, and avoid spreading it across other OSes. The Windows nodes should be able to keep track of global file names that are locally illegal. The files with illegal global names could be renamed locally, inserting a distinctive marker (like "!syncname" or whatever) in the name and replacing/removing any illegal characters or sequences (of course it would have to store the mapping of the local and global names somewhere). At least the files would actually be synced, the user would know there is a naming problem, and might have a cue of the original file name. When encountering such file, the Windows node could simply lookup the global name before proceeding with any network task. It could also keep track of file renaming, only changing the remote file name when the user manually removes the marker from the local name -- making it a valid file name for any OS anyway.
As for storage, instead of storing the local-global name mappings in the Syncthing database, why not simply store them in a local file? We already deal with ".stignore" and ".stfolder" at the root of a shared folder (we could even use the latter?). The mapping could as well be stored in a sibling system file, so the global name follows when the user moves folders around. Windows users are kinda used to see their folders cluttered with hidden system files such as "Desktop.ini" and "thumbs.db", so I don't see this approach as a big turn-off for them. And it would not affect in any way how the other nodes work. The Windows problem would remain a Windows problem.
Some of these are actually not just Windows-specific, i.e. leading dash in filenames on Linux and Mac which are technically allowed but a security vulnerability.
Replacing characters only on Windows using a mapping would be great, but there is no canonical mapping for this kind of thing (e.g. such as the canonical mapping that Unicode has in NFC or NFD) that users could use to know what Syncthing is doing, so it would break any cross-platform file tree comparison that any applications try and do and lead to data loss (e.g. a program running on Windows tries to access the same file on a Linux server and finds it missing).
It would probably be good to also prompt the user first with a list of pathnames that are invalid, and then offer to automatically fix them so the user does not have to waste time doing that manually if they do want to fix it.
For differences in Unicode form, one would actually NOT want to "fixup" the filename because this would lose data (Linus has several posts on this), and because there is already a canonical mapping designed to allow the same filename to have different canonical forms on different platforms (e.g. having some files in NFC and NFD on Windows/Linux and NFD on Mac).
But for invalid characters there is no such canonical mapping.