Add `read_str` method to `JsonReader`'s StringValueReader
Problem solved by the enhancement
The Read returned by JsonReader::next_string_reader will always provide valid UTF-8 bytes (assuming there is no bug in a JsonReader implementation). It might therefore be convenient if it was possible to directly obtain str pieces from it. Given that the data should be valid UTF-8, this might also allow unchecked UTF-8 conversion to further improve performance.
Enhancement description
📢 This is only a proposal and not a planned feature yet. If you are interested in this, add a 👍 or feel free to comment (also with any suggestions and feedback).
Add a new StringValueReader trait (similarly named as the existing StringValueWriter trait), and let JsonReader::next_string_reader return that.
Add a read_str method to the trait which takes a &[u8] as input and returns a &str backed by it as output. Since StringValueReader should already provide valid UTF-8 data through its read method, this new read_str method could have the following default implementation:
// TODO doc: Returned str might be shorter than `buf` if EOF has been reached (possibly even empty str),
// or if last UTF-8 char bytes might not have fit completely into `buf`
// TODO: This implementation should probably retry reading in case of `ErrorKind::Interrupted`,
// and document this behavior then, similar to `StringValueWriter::write_str`
fn read_str<'a>(&mut self, buf: &'a mut [u8]) -> Result<&'a str, IoError> {
let buf_len = buf.len();
if buf_len < utf8::MAX_BYTES_PER_CHAR {
panic!("Buffer is too small");
}
// Reserve space at end of buf to make sure no incomplete UTF-8 data is read, and
// all missing continuation bytes of last char (if any) still have enough space
let fill_buf = &mut buf[..(buf_len - (utf8::MAX_BYTES_PER_CHAR - 1))];
debug_assert!(!fill_buf.is_empty());
// Read once to determine if previous `read` left incomplete UTF-8 data in buffer; in that case fail fast
let mut pos = self.read(fill_buf)?;
if pos == 0 {
// Reached EOF, return empty string
return Ok("");
} else if utf8::is_continuation(fill_buf[0]) {
return Err(IoError::new(
ErrorKind::InvalidData,
"previous read left incomplete UTF-8 data",
));
}
while pos < fill_buf.len() {
let read = self.read(&mut fill_buf[pos..])?;
if read == 0 {
break;
}
pos += read;
}
// Find position where UTF-8 bytes of last char start
let mut last_char_start_pos = pos;
while utf8::is_continuation(buf[last_char_start_pos]) {
// This should be overflow safe, because StringValueReader should always provide valid UTF-8 bytes,
// and due to `is_continuation` check at beginning of method
last_char_start_pos -= 1;
}
let last_char_start_byte = buf[last_char_start_pos];
let mut missing_bytes = if utf8::is_1byte(last_char_start_byte) {
0
} else if utf8::is_2byte_start(last_char_start_byte) {
1
} else if utf8::is_3byte_start(last_char_start_byte) {
2
} else if utf8::is_4byte_start(last_char_start_byte) {
3
} else {
// StringValueReader should always provide valid UTF-8 bytes
panic!("Unexpected: Malformed UTF-8 bytes in buffer: {buf:02X?}");
};
// Subtract number of continuation bytes which are already in buffer
missing_bytes = usize::checked_sub(missing_bytes, pos - last_char_start_pos)
// StringValueReader should always provide valid UTF-8 bytes, but apparently it provided too many continuation bytes
.expect("Unexpected: Malformed UTF-8 bytes in buffer");
// If necessary, read remaining bytes of last char
if missing_bytes > 0 {
self.read_exact(&mut buf[pos..(pos + missing_bytes)])?;
pos += missing_bytes;
}
// This should be safe because StringValueReader should only return valid UTF-8 bytes
// TODO: Maybe this is too risky for custom JsonReader implementations, where it cannot be guaranteed
// that they are bug free and only return valid UTF-8 data
let string = utf8::to_str_unchecked(&buf[..pos]);
Ok(string)
}
:warning: This code has not been tested yet!
It might then also be possible to override Read::read_to_string to use read_str instead for the string pieces to avoid UTF-8 validation overhead; though this might not be worth it, especially since the default read_to_string implementation seems to be highly optimized already.
Alternatives / workarounds
- Don't add any new method and let user handle the UTF-8 conversion
- at the risk that users would have to re-implement this themselves and might have bugs in their implementation
- requiring the users to perform unchecked UTF-8 conversion in their code, if desired