bstr
bstr copied to clipboard
Support UTF8 sequence length
I think it would be good to expose a free function from bstr that exposes some decoding-specific information about what a given byte means in the context of utf8.
Accessing this info is low level, but has various use cases -- some examples include finding a place to start parsing from given an index, finding a legal cutoff position if you need to truncate a buffer... Etc. (Let me know if you want more cases, I feel like I run into it a fair bit when working with partially invalid utf8).
Specifically, something like this:
// If `b` indicates the start of a utf8 sequnence boundary,
// returns `Some(sequence_len)`. Returns `None` for all other cases.
pub fn utf8_sequence_len(b: u8) -> Option<usize>;
Or... Maybe. I'd kinda like to distinguish between valid-but-not-leading and always-invalid bytes. Returning an enum maybe? Thoughts and bikeshedding welcome, I think in practice this would be useful, but also wanted to keep the things small and simple.
That said, I do feel strongly that this should not be methods on byteslice like ByteSlice::is_char_boundary(&self, index: usize) -> bool and ByteSlice::utf8_sequence_len(&self, index: usize) -> Option<usize> (mentioning mostly because I suggested these in #42) -- I think those two would be very confusing in practice:
-
ByteSlice::is_char_boundarywould have to return different results fromstr::is_char_boundaryeven for a fully utf8 byte slice (example: index == len). Having the caller get the byte in question avoids this issue. (Renaming it doesn't even really solve this problem -- still seems like it could cause confusion if 0/len are not conidered boundaries). -
ByteSlice::utf8_sequence_len(&self, idx)could behave too many ways -- specifically IDK if it only readsself[idx]or if it considers other bytes nearby (e.g. if it's not a leading byte). Making it a top level function only taking au8removes this ambiguity -- reasonably only one thing it could do
This seems reasonable to me. Although I think this does come with a pretty big caveat that in the context of bstr, the value returned by this function is merely hint. There is of course no guarantee that there are actually a that number of bytes following b in the original slice (unlike in the case for &str).
I think it would help a lot if the docs for this contained a condensed example derived from a real use case, in order to help folks understand when they might want to use this.