rnr
rnr copied to clipboard
Support for non-valid UTF-8 strings
Extend code to support non-valid UTF-8 strings in filenames, paths and arguments:
- Use OsStr and OsString.
- Follow OsStr pattern API extension in Rust repository.
- Check issues with current crates:
clap
,regex
,walkdir
andansi_term
Right now it is not possible to convert OsStr(ing)
to &[u8]
on Windows to be used in regex::bytes::Regex::replace
without losing information. For example, ripgrep
uses a to_string_lossy
conversion to obtain a &[u8]
in Windows.
Yeah, this is something I've always wondered about. So far, I haven't had anyone complain about cases where information is lost, i.e., when there's an invalid UTF-16 file path on Windows. One presumes that this might be so infrequent that it may not be a blocking problem in practice.
Getting a real fix for this is tricky. One possibility is to use the underlying representation of an OsStr
(which is WTF-8), but this is not part of the public API. Another possibility is to re-create WTF-8 decoding outside of std
using the Windows version of the OsStrExt
trait. But this incurs a second WTF-8 decoding step, however, it's no worse than the lossy UTF-8 decoding that I'm already doing.