otp icon indicating copy to clipboard operation
otp copied to clipboard

file:list_dir/1 returns wrong directory name for high codepoints on Windows

Open josevalim opened this issue 4 years ago • 1 comments

Take this filename: "🎠.txt", with the Carousel Horse Emoji, codepoint U+1F3A0 (127904 in decimal base).

Calling file:list_dir/1 and file:list_dir_all/1 in the directory with said file returns the wrong filename:

1> file:list_dir(".").
{ok, [[55356,57248,46,116,120,116]]}

While it should return:

1> file:list_dir(".").
{ok, [[127904,46,116,120,116]]}

This happens on Windows, both on werl and erl, and with and without the +fnu flag. I was able to reproduce it on OTP 21 and OTP 23.1.

I have noticed that the codepoint U+FF01 (65281 in base 10) in the filename works fine - but I could not find a codepoint with 5 hexdigits that worked (but I haven't tried them all).

josevalim avatar Apr 28 '21 13:04 josevalim

Thanks for your report!

Windows filenames use a strange UTF-16 variant where unpaired or unordered surrogates are allowed, so we have a special conversion routine to deal with this encoding. Unfortunately it seems to handle the problem by returning all code points as they are, making no effort to decode surrogate pairs. :-(

I think the most reasonable way to fix this is to treat filenames as ordinary UTF-16 and fall back to raw filenames whenever they're invalid, much like we do for UTF-8. It's not backwards compatible but I have a hard time seeing anyone rely on this behavior. We'll try to fix it in OTP 25.

jhogberg avatar May 03 '21 14:05 jhogberg