Add `read_str` method to `JsonReader`'s StringValueReader

Open Marcono1234 opened this issue 2 years ago • 0 comments

Problem solved by the enhancement

The Read returned by JsonReader::next_string_reader will always provide valid UTF-8 bytes (assuming there is no bug in a JsonReader implementation). It might therefore be convenient if it was possible to directly obtain str pieces from it. Given that the data should be valid UTF-8, this might also allow unchecked UTF-8 conversion to further improve performance.

Enhancement description

📢 This is only a proposal and not a planned feature yet. If you are interested in this, add a 👍 or feel free to comment (also with any suggestions and feedback).

Add a new StringValueReader trait (similarly named as the existing StringValueWriter trait), and let JsonReader::next_string_reader return that.

Add a read_str method to the trait which takes a &[u8] as input and returns a &str backed by it as output. Since StringValueReader should already provide valid UTF-8 data through its read method, this new read_str method could have the following default implementation:

// TODO doc: Returned str might be shorter than `buf` if EOF has been reached (possibly even empty str),
// or if last UTF-8 char bytes might not have fit completely into `buf`
// TODO: This implementation should probably retry reading in case of `ErrorKind::Interrupted`,
// and document this behavior then, similar to `StringValueWriter::write_str`
fn read_str<'a>(&mut self, buf: &'a mut [u8]) -> Result<&'a str, IoError> {
    let buf_len = buf.len();
    if buf_len < utf8::MAX_BYTES_PER_CHAR {
        panic!("Buffer is too small");
    }

    // Reserve space at end of buf to make sure no incomplete UTF-8 data is read, and
    // all missing continuation bytes of last char (if any) still have enough space
    let fill_buf = &mut buf[..(buf_len - (utf8::MAX_BYTES_PER_CHAR - 1))];
    debug_assert!(!fill_buf.is_empty());

    // Read once to determine if previous `read` left incomplete UTF-8 data in buffer; in that case fail fast
    let mut pos = self.read(fill_buf)?;
    if pos == 0 {
        // Reached EOF, return empty string
        return Ok("");
    } else if utf8::is_continuation(fill_buf[0]) {
        return Err(IoError::new(
            ErrorKind::InvalidData,
            "previous read left incomplete UTF-8 data",
        ));
    }

    while pos < fill_buf.len() {
        let read = self.read(&mut fill_buf[pos..])?;
        if read == 0 {
            break;
        }
        pos += read;
    }

    // Find position where UTF-8 bytes of last char start
    let mut last_char_start_pos = pos;
    while utf8::is_continuation(buf[last_char_start_pos]) {
        // This should be overflow safe, because StringValueReader should always provide valid UTF-8 bytes,
        // and due to `is_continuation` check at beginning of method
        last_char_start_pos -= 1;
    }

    let last_char_start_byte = buf[last_char_start_pos];
    let mut missing_bytes = if utf8::is_1byte(last_char_start_byte) {
        0
    } else if utf8::is_2byte_start(last_char_start_byte) {
        1
    } else if utf8::is_3byte_start(last_char_start_byte) {
        2
    } else if utf8::is_4byte_start(last_char_start_byte) {
        3
    } else {
        // StringValueReader should always provide valid UTF-8 bytes
        panic!("Unexpected: Malformed UTF-8 bytes in buffer: {buf:02X?}");
    };
    // Subtract number of continuation bytes which are already in buffer
    missing_bytes = usize::checked_sub(missing_bytes, pos - last_char_start_pos)
        // StringValueReader should always provide valid UTF-8 bytes, but apparently it provided too many continuation bytes
        .expect("Unexpected: Malformed UTF-8 bytes in buffer");

    // If necessary, read remaining bytes of last char
    if missing_bytes > 0 {
        self.read_exact(&mut buf[pos..(pos + missing_bytes)])?;
        pos += missing_bytes;
    }

    // This should be safe because StringValueReader should only return valid UTF-8 bytes
    // TODO: Maybe this is too risky for custom JsonReader implementations, where it cannot be guaranteed
    // that they are bug free and only return valid UTF-8 data
    let string = utf8::to_str_unchecked(&buf[..pos]);
    Ok(string)
}

:warning: This code has not been tested yet!

It might then also be possible to override Read::read_to_string to use read_str instead for the string pieces to avoid UTF-8 validation overhead; though this might not be worth it, especially since the default read_to_string implementation seems to be highly optimized already.

Alternatives / workarounds

Don't add any new method and let user handle the UTF-8 conversion
- at the risk that users would have to re-implement this themselves and might have bugs in their implementation
- requiring the users to perform unchecked UTF-8 conversion in their code, if desired

Sep 09 '23 12:09 Marcono1234