Suggestion: Add an alternative function to output as UTF-16LE `&[u8]` slice

Open ColinFinck opened this issue 5 years ago • 1 comments

In the Windows world UTF-16 strings are not only encountered when interfacing with APIs, but also in a few on-disk structures (e.g. NT registry hives or NTFS filesystems). This complicates interoperability with Rust's UTF-8 world, especially in no_std environments.

My current approach when writing a parser for such an on-disk structure is as follows:

I define my own Utf16ByteString type that just wraps a &[u8].
All parser functions that output a string just return the byte slice encompassing that string in a Utf16ByteString. This has zero cost.
For users with alloc or std, my Utf16ByteString provides a to_string function that uses char::decode_utf16(bytes.chunks_exact(2).map(|two_bytes| u16::from_le_bytes(two_bytes.try_into().unwrap()))) internally. Apart from the required allocations, this function also comes with decoding overhead.

Of course, I like to avoid using to_string, and a frequent case where this should be possible are (case-sensitive) comparisons. Currently, I have to create the comparison byte buffers by hand though, e.g. let hello = &[b'H', 0, b'e', 0, b'l', 0, b'l', 0, b'o', 0]. Latest const-utf16 is no help here, as its encode! only outputs a &[u16]. I could transmute my &[u8] to a &[u16], but that would be an unsafe hack and prone to endian problems.

Could const-utf16 therefore be extended to alternatively output a UTF-16LE &[u8] slice for such comparisons? Or am I missing a zero-cost alternative here?

Mar 23 '21 18:03 ColinFinck

Hmmmm... I have to think a bit about this. The best possibility would be a safe way to convert &[u16] to &[u8]. Hopefully, Rust will have this capability someday.

I think this could could do what you want:

use std::convert::TryInto;

fn compare(u16_slice: &[u16], u8_slice: &[u8]) -> bool {
    u16_slice.len() * 2 == u8_slice.len()
        && u16_slice.iter().copied().eq(u8_slice
            .chunks_exact(2)
            .map(|two_bytes| u16::from_le_bytes(two_bytes.try_into().unwrap())))
}

This is fairly low cost and does what you want.

Mar 24 '21 10:03 rylev