bstr Support UTF8 sequence length

Support UTF8 sequence length

Open thomcc opened this issue 5 years ago • 1 comments

I think it would be good to expose a free function from bstr that exposes some decoding-specific information about what a given byte means in the context of utf8.

Accessing this info is low level, but has various use cases -- some examples include finding a place to start parsing from given an index, finding a legal cutoff position if you need to truncate a buffer... Etc. (Let me know if you want more cases, I feel like I run into it a fair bit when working with partially invalid utf8).

Specifically, something like this:

// If `b` indicates the start of a utf8 sequnence boundary,
// returns `Some(sequence_len)`. Returns `None` for all other cases.
pub fn utf8_sequence_len(b: u8) -> Option<usize>;

Or... Maybe. I'd kinda like to distinguish between valid-but-not-leading and always-invalid bytes. Returning an enum maybe? Thoughts and bikeshedding welcome, I think in practice this would be useful, but also wanted to keep the things small and simple.

That said, I do feel strongly that this should not be methods on byteslice like ByteSlice::is_char_boundary(&self, index: usize) -> bool and ByteSlice::utf8_sequence_len(&self, index: usize) -> Option<usize> (mentioning mostly because I suggested these in #42) -- I think those two would be very confusing in practice:

ByteSlice::is_char_boundary would have to return different results from str::is_char_boundary even for a fully utf8 byte slice (example: index == len). Having the caller get the byte in question avoids this issue. (Renaming it doesn't even really solve this problem -- still seems like it could cause confusion if 0/len are not conidered boundaries).
ByteSlice::utf8_sequence_len(&self, idx) could behave too many ways -- specifically IDK if it only reads self[idx] or if it considers other bytes nearby (e.g. if it's not a leading byte). Making it a top level function only taking a u8 removes this ambiguity -- reasonably only one thing it could do

Mar 10 '20 09:03 thomcc

This seems reasonable to me. Although I think this does come with a pretty big caveat that in the context of bstr, the value returned by this function is merely hint. There is of course no guarantee that there are actually a that number of bytes following b in the original slice (unlike in the case for &str).

I think it would help a lot if the docs for this contained a condensed example derived from a real use case, in order to help folks understand when they might want to use this.

Mar 10 '20 10:03 BurntSushi

bstr bstr copied to clipboard

Support UTF8 sequence length

bstr
bstr copied to clipboard