coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

decide if and how to support non-UTF-8, single byte field separators/delimiters on non-Unix

Open jtracey opened this issue 3 years ago • 1 comments

Related: #554

As part of getting GNU's join tests to pass (#2634), we implemented support for non-Unicode field separators on unix-like platforms (#2902). The reason for not supporting other platforms is that only std::os::unix::ffi::OsStrExt provides the as_bytes() method for OsStrs (see the std::ffi docs on conversions). Clap can only provide arguments in one of two forms: as Rust Strings, or as OsStrings. The former can only represent valid Unicode data, and the latter has a platform-dependent representation, which is by default opaque to the consumer. Because of this, most OSs cannot directly represent arbitrary single bytes in arguments.

Using a non-ASCII byte as a field separator is somewhat rare, but far from unheard of. It should be decided if this usage is common enough to warrant supporting it on other, non-unix platforms, and if so, how that support should be implemented. The options I see are:

  • Extend the \0 syntax implemented in #2881. While GNU doesn't support it, we could fairly easily extend this to parsing everything following the \ as a u8 (presumably following prtinf's syntax). This would have the advantage of also making it easier to use non-Unicode values generally, even on Unix platforms, where they currently have to be created using workarounds like $(printf '\247'). The disadvantage is that it contradicts GNU's behavior in a literal sense, though not in any way that is currently tested.
  • Add a new option. Similar to the above, but rather than extending an existing option, it would avoid directly contradicting GNU's behavior by making a new --separator-value option or something. Presumably it shouldn't be hard to pick a name that has a very low probability of ever colliding with a future GNU option. The disadvantage of this is it adds a redundant option to something already exposed, in a non-standard way.
  • OS-specific hacks. This would only be available on Windows, since it's the only other OS that exposes its internal OsString representation in any way, but we could hack around the UTF-16 values to represent any single byte value we want. E.g., my understanding (someone with a Windows dev environment can confirm) is that there are ways to pass invalid UTF-16 arguments from the command line, so we could choose to interpret values from 0xD800 to 0xD8FF as the bytes from 0x00 to 0xFF, respectively. Currently, these values (in isolation) will cause an error, as they can't be turned into UTF-8, and have no obvious alternative meaning, so they are in some sense "safe" to overload (i.e., GNU can't even represent these values, let alone have intended behavior for them). The disadvantages are that this could only be exposed on Windows, and is a pretty unintuitive hack that would need some explaining.

The options basically run from "most elegant but least safe" to "safest but least elegant", where "safe" here is "unlikely to conflict with GNU join behavior" (e.g., if someone were to write a fuzzer to compare behavior, how careful would they have to be). My personal preference is for the first option, but I also don't use non-unix platforms, so don't need much of a say.

jtracey avatar Feb 05 '22 18:02 jtracey

It occurs to me that this isn't actually just a join issue, several utils have very closely related problems that should probably use the same solution.

util option GNU escape sequences
cut -d DELIM none
join -t CHAR \0
paste -d LIST \0, \b, \f, \n, \r, \t, \v*
sort -t SEP \0
split -t SEP \0

Of these, only our join implementation seems to handle non-UTF-8 bytes properly so far. Our paste implementation is also broken when supplying what can be interpreted as multi-byte characters (it interprets them as single delimiters, instead of a list of multiple delimiters). Our split implementation doesn't support -t yet at all (#3192).

Paste is weird, in that its -d takes a list of "characters" that are rotated through, and also in that it supports many of the C escape sequences, but not all. This is just based on experimentation (the GNU paste man page and info page are pretty sparse), but it seems like aside from the escape sequences listed, everything else gets interpreted as the literal next character. E.g., paste interprets \a as a, not the bell character, and \10 as the list of characters 10, not octal for 8. This all throws a wrench into the first option I gave (supporting printf's escape syntax) if we want to share the same solution for all utilities, since e.g. paste -d "\041" would be interpreted by GNU as [0, '4', '1'] while ours would be ['!']—a much more problematic diversion in behavior than accepting arguments GNU rejects.

I'd like to see our cut support \0 somehow, since without it, there's no way to cut on the null byte delimiter, even on Unix platforms. I'm not sure why GNU never added that functionality.

My current proposal then is a new option for all the above:

-D, --parsed-delimiter=DELIM
         like [-d/-t], but parse \-escape sequences like printf(1) would

and keep the existing options as close to GNU as possible. This would have the bonus benefit that users don't have to remember, e.g., "does join use -t like sort, or -d like cut?" any more, and can just use -D for everything, with consistent behavior.

jtracey avatar Jun 15 '22 00:06 jtracey