coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

`ls --hyperlink` doesn't properly percent-escape

Open egmontkob opened this issue 3 weeks ago • 1 comments

I've created two files, filenames encoded in UTF-8, one is called é (U+00E9, i.e. 0xC3 0xA9), the other is (U+2588, i.e. 0xE2 0x96 0x88).

I'm running ls --hyperlink.

For the first file, the URL inside the OSC 8 escape sequence contains the letter é in its unchanged UTF-8 representation. This may open correctly in some circumstances, but the Hyperlinks in terminal emulators spec explicitly states that any high byte must be percent-encoded. The raw format is just too fragile, implementations might easily get its charset wrong, a layer of luit or such would break it, etc.

For the second file, the URL contains %88, i.e. the last of the three bytes is there percent-encoded, the first two bytes disappear. This clearly does not open the URL correctly in any terminal.

It's expected that even filenames that are invalid in the current locale (e.g. invalid UTF-8) should be located, i.e. the exact byte sequence is always preserved. Only printable ASCII characters can appear as-is (and even there some special chars, such as ? and # for query parameters and fragment need to be escaped -- these two are handled correctly). Every byte above 127 needs to be percent-escaped individually (regardless of charset).

v0.2.2 from Ubuntu 25.10

egmontkob avatar Dec 01 '25 22:12 egmontkob

The observation that only the last UTF-8 byte of U+2588 is percent-escaped was somewhat incorrect. It's the low byte of the Unicode value that's printed like that. They often happen to be the same (0x88 in this case), but even more often they are different.

create_hyperlink() first calls to_string_lossy() which is already a red flag, the URI cannot be sloppy, it needs to point to the very file, even if its name contains invalid UTF-8. This method replaces any invalid UTF-8 by U+FFFD which in turn will appear as %fd in the OSC 8 escape sequence.

Later .chars() is called on absolute_path, so the code processes each character, rather than each byte. And then formats c as u8 as :02x, that's where only the low byte is kept.

PoC fix, seems to work for me: uutils-coreutils-9538-ls-hyperlink-escaping.patch

This is literally (yes, truly, literally) the very first time I'm ever writing any Rust code in my life, or looking at one for more than a few seconds. So surely I'm doing many things in a sub-ideal way. Feel free to throw away this patch and come up with your one (including unittests) :)

egmontkob avatar Dec 02 '25 21:12 egmontkob