rnr icon indicating copy to clipboard operation
rnr copied to clipboard

Support for non-valid UTF-8 strings

Open ismaelgv opened this issue 6 years ago • 2 comments

Extend code to support non-valid UTF-8 strings in filenames, paths and arguments:

  • Use OsStr and OsString.
  • Follow OsStr pattern API extension in Rust repository.
  • Check issues with current crates: clap, regex, walkdir and ansi_term

ismaelgv avatar Jul 12 '18 16:07 ismaelgv

Right now it is not possible to convert OsStr(ing) to &[u8] on Windows to be used in regex::bytes::Regex::replace without losing information. For example, ripgrep uses a to_string_lossy conversion to obtain a &[u8] in Windows.

ismaelgv avatar Aug 01 '18 23:08 ismaelgv

Yeah, this is something I've always wondered about. So far, I haven't had anyone complain about cases where information is lost, i.e., when there's an invalid UTF-16 file path on Windows. One presumes that this might be so infrequent that it may not be a blocking problem in practice.

Getting a real fix for this is tricky. One possibility is to use the underlying representation of an OsStr (which is WTF-8), but this is not part of the public API. Another possibility is to re-create WTF-8 decoding outside of std using the Windows version of the OsStrExt trait. But this incurs a second WTF-8 decoding step, however, it's no worse than the lossy UTF-8 decoding that I'm already doing.

BurntSushi avatar Sep 05 '18 00:09 BurntSushi