rust icon indicating copy to clipboard operation
rust copied to clipboard

std::io::BufReader read partial data even though EOF was not reached

Open WojciechMula opened this issue 4 months ago • 6 comments

I tried this code:

use std::collections::HashSet;
use std::fs::write;
use std::fs::File;
use std::io::BufReader;
use std::io::Read;

const RECORD_SIZE: usize = 12;

fn test_unbuffered() -> (usize, HashSet<usize>) {
    // 1. create large file
    let record = "x".repeat(RECORD_SIZE);
    let tmp = record.repeat(1024 * 32);

    let path = "tmp.bin";
    write(path, tmp).unwrap();

    // 2. read
    let mut f = File::open(path).unwrap();
    let mut buffer = [0; RECORD_SIZE];

    let mut reads = HashSet::<usize>::new();
    let mut total_read = 0;
    loop {
        let n = f.read(&mut buffer).unwrap();
        total_read += n;
        reads.insert(n);
        if n == 0 {
            return (total_read, reads);
        }
    }
}

fn test_buffered() -> (usize, HashSet<usize>) {
    // 1. create large file
    let record = "x".repeat(RECORD_SIZE);
    let tmp = record.repeat(1024 * 32);

    let path = "tmp.bin";
    write(path, tmp).unwrap();

    // 2. read
    let mut f = File::open(path).unwrap();
    let mut f = BufReader::new(f);
    let mut buffer = [0; RECORD_SIZE];

    let mut reads = HashSet::<usize>::new();
    let mut total_read = 0;
    loop {
        let n = f.read(&mut buffer).unwrap();
        total_read += n;
        reads.insert(n);
        if n == 0 {
            return (total_read, reads);
        }
    }
}

fn main() {
    println!("test_unbuffered = {:?}", test_unbuffered());
    println!("test_buffered = {:?}", test_buffered());
}

I expected to see this happen:

Regardless using buffering or not, I expected to read exactly RECORD_SIZE chunk of bytes in each read invocation.

Instead, this happened:

When buffering is used, records that crosses the internal buffer size are not read as whole.

rustc --version --verbose:

rustc 1.75.0 (82e1608df 2023-12-21)
binary: rustc
commit-hash: 82e1608dfa6e0b5569232559e3d385fea5a93112
commit-date: 2023-12-21
host: x86_64-unknown-linux-gnu
release: 1.75.0
LLVM version: 17.0.6

WojciechMula avatar Feb 01 '24 09:02 WojciechMula

the read documentation says

It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.

the8472 avatar Feb 01 '24 10:02 the8472

This would require some left-shifting refill. That could be costly for large buffers if it happened automatically. So I think it'd have to be a new method like buffer.fill_at_least_n_or_end(n) -> Result<> or something like that.

@shepmaster ran into pretty much the same thing recently.

the8472 avatar Feb 01 '24 17:02 the8472

It would be perfectly fine if this particular behavior was clearly documented.

I think you don't need any left-shifting. The algorithm for handling cross-buffer reads would look like that:

    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        // If we don't have any buffered data and we're doing a massive read
        // (larger than our internal buffer), bypass our internal buffer
        // entirely.
        if self.buf.pos() == self.buf.filled() && buf.len() >= self.capacity() {
            self.discard_buffer();
            return self.inner.read(buf);
        }
        /* current
        let mut rem = self.fill_buf()?;
        let nread = rem.read(buf)?;
        self.consume(nread);
        Ok(nread)
        */

        let mut total_nread = 0;
        while total_nread < buf.len() {
           let mut rem = self.fill_buf()?;
           let nread = rem.read(&buf[total_nread..])?;
           self.consume(nread);
           total_nread += nread;
        }

        Ok(total_nread);
    }

WojciechMula avatar Feb 01 '24 21:02 WojciechMula

There is a difference between

a) filling a BufReader's buffer with at least n bytes b) filling the passed buf with at least n bytes

b) can currently be achieved by using reader.take(n).read_to_end(&mut vec), performing multiple read() calls into an on-stack buffer until it has as many bytes you need. Combining Take and io::copy into a VecDeque can also work if you need a ring buffer. In all those cases you wouldn't need a BufReader to achieve that that.

We can't implement this in read() because BufReader may wrap a TcpStream that has some data available, but not enough to fill the buffer then a single call to fill_buf can return bytes but multiple calls would block the connection. This can lead to a logical deadlock the other side of the connection is waiting for a response to the data this has already been sent. In other words we should avoid turning a single read call into multiple read calls unless the caller explicitly requests a minimum read length. A read call does not imply such intent, it explicitly allows short reads which some callers may rely on.

Essentially read is a fairly thin wrapper around the underlying read(2) syscall which has the same properties.

To achieve a) we would need a new API which does the left-shifting inside BufReader's API.

Another option, if you know for certain that those bytes should be available and that an EOF will not be encountered then you can also use read_exact. The issue with that method is that it may drop a partial read on the floor when an EOF is encountered.

the8472 avatar Feb 01 '24 22:02 the8472

@the8472 thanks a lot for the explanation, I wasn't aware of TCP connections. Extending BufReader would be nice, although there are workarounds you proposed. In my particular case the solution is simple -- set the inner buffer size to multiple of record size using with_capacity constructor.

WojciechMula avatar Feb 02 '24 12:02 WojciechMula

Note that even for regular files it is not guaranteed that you won't get short reads. As I understand it many filesystems will make an effort to fulfill the entire request but some (e.g. network filesystems, FUSE drivers) may not. It might also depend on the platform.

the8472 avatar Feb 02 '24 12:02 the8472