rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

Add a str::from_utf8_prefix method

Open canndrew opened this issue 4 years ago • 8 comments

I want to take a slice of bytes and get the longest prefix of the slice which is valid utf8. My use case is that I'm reading utf8 data from the network but the read buffer might not be full, so it might contain a partial codepoint at the end. Also, if the remote host sends invalid utf8 it would be useful to be able to process any valid utf8 they send beforehand.

It's trivial to implement this function using Utf8Error::valid_up_to, but it requires unsafe code to avoid performing validation twice:

fn from_utf8_prefix(bytes: &[u8]) -> &str {
    match str::from_uf8(bytes) {
        Ok(s) => s,
        Err(err) => {
            unsafe {
                str::from_utf8_unchecked(&bytes[..err.valid_up_to()])
            }
        },
    }
}

This is even what the example code for Utf8Error does.

Since the standard library already contains the machinery for attempting utf8 validation and getting the index at which it fails, and since forcing people to write unsafe code for something trivial is generally a bad thing, I think this function should exist in std::str along-side from_utf8 and friends.

canndrew avatar Dec 01 '21 07:12 canndrew

I think you could just PR this right?

Diggsey avatar Dec 01 '21 12:12 Diggsey

What about adding a method to Utf8Error that returns the valid prefix or an empty &str (or Option<&str> so that None could indicate that there was no valid utf data)?

let value = str::from_utf8(bytes).unwrap_or_else(|e| e.valid_prefix())

aloucks avatar Dec 01 '21 15:12 aloucks

adding a method to Utf8Error that returns the valid prefix

Is that backwards-compatible? To do this, Utf8Error would need to gain a lifetime parameter and the signature of from_utf8 would need to change:

// from (manually removing elision)
pub fn from_utf8<'a>(v: &'a [u8]) -> Result<&'a str, Utf8Error>
// to
pub fn from_utf8<'a>(v: &'a [u8]) -> Result<&'a str, Utf8Error<'a>>

shepmaster avatar Dec 02 '21 03:12 shepmaster

adding a method to Utf8Error that returns the valid prefix

Is that backwards-compatible? To do this, Utf8Error would need to gain a lifetime parameter and the signature of from_utf8 would need to change:

I don't think that would work, especially since Utf8Error is often stored in an io::Error that is returned to calling functions, potentially all the way to main, likely outliving whatever byte buffer you passed to from_utf8.

programmerjake avatar Dec 02 '21 03:12 programmerjake

You could have the Utf8Error store the byte position and hand out that on request, then you can at least do an unchecked conversion.

Lokathor avatar Dec 02 '21 05:12 Lokathor

the byte position is already stored in the Utf8Error has shown in OP's example code...

kennytm avatar Dec 02 '21 08:12 kennytm

oops, was only thinking about one reply back, my bad.

Lokathor avatar Dec 02 '21 13:12 Lokathor

In Rust 1.79 this is now bytes.utf8_chunks().next().map_or("", std::str::Utf8Chunk::valid) or variations thereof.

LunarLambda avatar Jun 19 '24 10:06 LunarLambda