rfcs
rfcs copied to clipboard
Add a str::from_utf8_prefix method
I want to take a slice of bytes and get the longest prefix of the slice which is valid utf8. My use case is that I'm reading utf8 data from the network but the read buffer might not be full, so it might contain a partial codepoint at the end. Also, if the remote host sends invalid utf8 it would be useful to be able to process any valid utf8 they send beforehand.
It's trivial to implement this function using Utf8Error::valid_up_to, but it requires unsafe code to avoid performing validation twice:
fn from_utf8_prefix(bytes: &[u8]) -> &str {
match str::from_uf8(bytes) {
Ok(s) => s,
Err(err) => {
unsafe {
str::from_utf8_unchecked(&bytes[..err.valid_up_to()])
}
},
}
}
This is even what the example code for Utf8Error does.
Since the standard library already contains the machinery for attempting utf8 validation and getting the index at which it fails, and since forcing people to write unsafe code for something trivial is generally a bad thing, I think this function should exist in std::str along-side from_utf8 and friends.
I think you could just PR this right?
What about adding a method to Utf8Error that returns the valid prefix or an empty &str (or Option<&str> so that None could indicate that there was no valid utf data)?
let value = str::from_utf8(bytes).unwrap_or_else(|e| e.valid_prefix())
adding a method to
Utf8Errorthat returns the valid prefix
Is that backwards-compatible? To do this, Utf8Error would need to gain a lifetime parameter and the signature of from_utf8 would need to change:
// from (manually removing elision)
pub fn from_utf8<'a>(v: &'a [u8]) -> Result<&'a str, Utf8Error>
// to
pub fn from_utf8<'a>(v: &'a [u8]) -> Result<&'a str, Utf8Error<'a>>
adding a method to
Utf8Errorthat returns the valid prefixIs that backwards-compatible? To do this,
Utf8Errorwould need to gain a lifetime parameter and the signature offrom_utf8would need to change:
I don't think that would work, especially since Utf8Error is often stored in an io::Error that is returned to calling functions, potentially all the way to main, likely outliving whatever byte buffer you passed to from_utf8.
You could have the Utf8Error store the byte position and hand out that on request, then you can at least do an unchecked conversion.
the byte position is already stored in the Utf8Error has shown in OP's example code...
oops, was only thinking about one reply back, my bad.
In Rust 1.79 this is now bytes.utf8_chunks().next().map_or("", std::str::Utf8Chunk::valid) or variations thereof.