zfs normalization=formD fails to allow writing of some valid utf8 filenames

System information

Describe the problem you're observing

There seem to be utf8 filenames, that can be written when a dataset has been created with normalization=form[K]C (or none), but cannot be written with form[K]D.

A public example to such a filename:

https://www.youtube.com/watch?v=GS1YzVCXttU

Despite this seems to being valid utf8, the file(name) cannot be written with formD. In addition normalization should only take place when reading (comparing) a file. Since the filesystem may check for the existence of an equally named a file before writing, this could indicate a bug.

With a formD initalized dataset, people have been able to create filenames using any single korean character of above string, but not the whole. Again, no issues with formC whatsoever.

Describe how to reproduce the problem

Try to download the file above while preserving it's filename on a formD and formC dataset. Or simply copy past the name and touch the file on a formC and formD dataset.

Include any warning/errors/backtraces from the system logs

$ touch "Auf dem Wasser zu Singen 물 위에서 노래 부름' F Schubert 고도희 김경미 김한나 Piano 김보겸 ‘2011년 이주경 교수와 함께하는 제자 봄 연주회"; echo $?

touch: cannot touch '''Auf dem Wasser zu Singen '$'\353\254\274'' '$'\354\234\204\354\227\220\354\204\234'' '$'\353\205\270\353\236\230'' '$'\353\266\200\353\246\204''' F Schubert '$'\352\263\240\353\217\204\355\235\254'' '$'\352\271\200\352\262\275\353\257\270'' '$'\352\271\200\355\225\234\353\202\230'' Piano '$'\352\271\200\353\263\264\352\262\270'' '$'\342\200\230''2011'$'\353\205\204'' '$'\354\235\264\354\243\274\352\262\275'' '$'\352\265\220\354\210\230\354\231\200'' '$'\355\225\250\352\273\230\355\225\230\353\212\224'' '$'\354\240\234\354\236\220'' '$'\353\264\204'' '$'\354\227\260\354\243\274\355\232\214': Operation not supported 1

Feb 07 '20 12:02 EdWoow

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Feb 07 '21 01:02 stale[bot]

This still happens on ZoL 2.0.4 on Tumbleweed 20210307.

When cutting off characters at the end, I can create the file:

$ touch "Auf dem Wasser zu Singen 물 위에서 노래 부름 F Schubert 고도희 김경미 김한나 Piano 김보겸 2011년 이주경 교수와 "; echo $?
0

Adding "아" at the end also works:

$ touch "Auf dem Wasser zu Singen 물 위에서 노래 부름 F Schubert 고도희 김경미 김한나 Piano 김보겸 2011년 이주경 교수와 아"; echo $?
0

"하" works as well:

$ touch "Auf dem Wasser zu Singen 물 위에서 노래 부름 F Schubert 고도희 김경미 김한나 Piano 김보겸 2011년 이주경 교수와 하"; echo $?
0

However, the original "함" fails:

$ touch "Auf dem Wasser zu Singen 물 위에서 노래 부름 F Schubert 고도희 김경미 김한나 Piano 김보겸 2011년 이주경 교수와 함"; echo $?
touch: cannot touch 'Auf dem Wasser zu Singen 물 위에서 노래 부름 F Schubert 고도희 김경미 김한나 Piano 김보겸 2011년 이주경 교수와 함': Operation not supported
1

So, longer filenames can't be set with normalization=formD. With normalization=none, all file names listed above can be set.

Mar 14 '21 11:03 clavinet

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Mar 17 '22 10:03 stale[bot]

Possibly related: https://github.com/openzfs/zfs/issues/2234

Since it works with formC it may be that the fully-decomposed string is hitting a length limit somewhere, while the fully-composed is collapsed back down to fit before the check.

May 21 '22 06:05 sevmonster

I can't believe this is still an issue after 9 years (see #2234), and here is how to reproduce:

This works:

$ touch 'のののののののののののののののののののののののののののののののののののののののののののののののののののののののののののののののののののののののののののの'

This doesn't:

$ touch 'ザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザ'
touch: cannot touch 'ザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザザ': Operation not supported

And it would be nice to allow maximum file name length to be increased (see #13043)

Apr 14 '23 07:04 vizv

I finally found the issue is caused by the current ZAP implementation.

The ZAP doc found in include/sys/zap.h says:

The name is a zero-terminated string of up to ZAP_MAXNAMELEN bytes (including terminating NULL).

And the implementation in module/zfs/zap_micro.c:

static int
zap_normalize(zap_t *zap, const char *name, char *namenorm, int normflags)
{
        ASSERT(!(zap_getflags(zap) & ZAP_FLAG_UINT64_KEY));

        size_t inlen = strlen(name) + 1;
        size_t outlen = ZAP_MAXNAMELEN;

        int err = 0;
        (void) u8_textprep_str((char *)name, &inlen, namenorm, &outlen,
            normflags | U8_TEXTPREP_IGNORE_NULL | U8_TEXTPREP_IGNORE_INVALID,
            U8_UNICODE_LATEST, &err);

        return (err);
}

The function zap_normalize is used to normalize a name (a filename in my case), and calls u8_textprep_str and limit the normalized name up to ZAP_MAXNAMELEN bytes.

When formD is set for normalization, it only does decomposing:

の (e3 81 ae) decomposed to e3 82 b6
ザ (e3 82 b6) decomposed to e3 82 b5 e3 82 99 (e3 82 b5 for サ, and e3 82 99 for ゙)

@clavinet It's even worse for Korean, as 함 (ed 95 a8) decomposed to e1 84 92 e1 85 a1 e1 86 b7, which is occupies 3x spaces.

When the length of decomposed string is >= ZAP_MAXNAMELEN, error E2BIG is returned from u8_textprep_str, and zap_name_alloc will fail and returns NULL. That's why the message Operation not supported instead of File name too long, since the original filename doesn't hit the length limit as the doc says "File names are always stored unmodified, names are normalized as part of any comparison process.".

Note this issues not only affects Japanese or Korean users, but also any languages with glyphs composed of multiple parts, such as ё (d1 91) in #2234 will be decomposed to d0 b5 cc 88.

A possible solution is double the size for ZAP_MAXNAMELEN, but I'm not familiar with the code base and this might introduce undesired side-effects, and double the size for ZAP_MAXNAMELEN sill won't resolve some cases (Korean for example).

So I decided to stop using formD as suggested in many places, and use formC for normalization instead.

Apr 15 '23 02:04 vizv

Distribution Name	| Debian
Distribution Version	| 12 (stable)
Kernel Version	        | 6.1.0-17
Architecture	        | amd64
OpenZFS Version:
zfs-2.1.11-1
zfs-kmod-2.1.11-1

Ran into this issue after setting up a new pool with formD and copying data from a pool with neither normalization nor utf8only enabled. Picking a sample file showed that even though wc -c said the file name was 149 bytes in length (including some korean ones) file creation (with e.g. touch) failed with operation not supported. Dropping a single ASCII character from the end of the name (wc -c reporting 148 now) worked. I have recreated the target pool with formC instead and started the copy process, will edit this message in a few days once that has finished and report if formC helped. EDIT: Using formC for the pool has allowed the copy to complete with no issues. Not that this fixes the underlying issue, obviously, but it gives a workaround.

Jan 27 '24 18:01 Missingmew

So I decided to stop using formD as suggested in many places, and use formC for normalization instead.

This might work in most cases but NFC can still expand strings.

A possible solution is double the size for ZAP_MAXNAMELEN, but I'm not familiar with the code base and this might introduce undesired side-effects, and double the size for ZAP_MAXNAMELEN sill won't resolve some cases (Korean for example).

Changing ZAP_MAXNAMELEN is bad because it controls max name length of almost everything.

We could change the buffer used for normalization, although doubling is not good enough. According to https://www.unicode.org/faq/normalization.html#12 the expansion factor for NFC/NFD is 3. Making it worse, NFKC/NFKD have expansion factor of 11 and I believe it's also supported by ZFS as an option. This means that we can't just increase buffer size and call it a day because there's both a stack buffer (due to kernel stack size this can't be very big) and a buffer inside zap_name_t (if we increase the buffer size then we waste a lot of memory even though it's rare to have very long file names).

Jun 22 '24 12:06 nbdd0121

zfs zfs copied to clipboard

normalization=formD fails to allow writing of some valid utf8 filenames

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

zfs
zfs copied to clipboard