rust
rust copied to clipboard
std::io::BufReader read partial data even though EOF was not reached
I tried this code:
use std::collections::HashSet;
use std::fs::write;
use std::fs::File;
use std::io::BufReader;
use std::io::Read;
const RECORD_SIZE: usize = 12;
fn test_unbuffered() -> (usize, HashSet<usize>) {
// 1. create large file
let record = "x".repeat(RECORD_SIZE);
let tmp = record.repeat(1024 * 32);
let path = "tmp.bin";
write(path, tmp).unwrap();
// 2. read
let mut f = File::open(path).unwrap();
let mut buffer = [0; RECORD_SIZE];
let mut reads = HashSet::<usize>::new();
let mut total_read = 0;
loop {
let n = f.read(&mut buffer).unwrap();
total_read += n;
reads.insert(n);
if n == 0 {
return (total_read, reads);
}
}
}
fn test_buffered() -> (usize, HashSet<usize>) {
// 1. create large file
let record = "x".repeat(RECORD_SIZE);
let tmp = record.repeat(1024 * 32);
let path = "tmp.bin";
write(path, tmp).unwrap();
// 2. read
let mut f = File::open(path).unwrap();
let mut f = BufReader::new(f);
let mut buffer = [0; RECORD_SIZE];
let mut reads = HashSet::<usize>::new();
let mut total_read = 0;
loop {
let n = f.read(&mut buffer).unwrap();
total_read += n;
reads.insert(n);
if n == 0 {
return (total_read, reads);
}
}
}
fn main() {
println!("test_unbuffered = {:?}", test_unbuffered());
println!("test_buffered = {:?}", test_buffered());
}
I expected to see this happen:
Regardless using buffering or not, I expected to read exactly RECORD_SIZE
chunk of bytes in each read
invocation.
Instead, this happened:
When buffering is used, records that crosses the internal buffer size are not read as whole.
rustc --version --verbose
:
rustc 1.75.0 (82e1608df 2023-12-21)
binary: rustc
commit-hash: 82e1608dfa6e0b5569232559e3d385fea5a93112
commit-date: 2023-12-21
host: x86_64-unknown-linux-gnu
release: 1.75.0
LLVM version: 17.0.6
the read documentation says
It is not an error if the returned value n is smaller than the buffer size, even when the reader is not at the end of the stream yet. This may happen for example because fewer bytes are actually available right now (e. g. being close to end-of-file) or because read() was interrupted by a signal.
This would require some left-shifting refill. That could be costly for large buffers if it happened automatically. So I think it'd have to be a new method like buffer.fill_at_least_n_or_end(n) -> Result<>
or something like that.
@shepmaster ran into pretty much the same thing recently.
It would be perfectly fine if this particular behavior was clearly documented.
I think you don't need any left-shifting. The algorithm for handling cross-buffer reads would look like that:
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
// If we don't have any buffered data and we're doing a massive read
// (larger than our internal buffer), bypass our internal buffer
// entirely.
if self.buf.pos() == self.buf.filled() && buf.len() >= self.capacity() {
self.discard_buffer();
return self.inner.read(buf);
}
/* current
let mut rem = self.fill_buf()?;
let nread = rem.read(buf)?;
self.consume(nread);
Ok(nread)
*/
let mut total_nread = 0;
while total_nread < buf.len() {
let mut rem = self.fill_buf()?;
let nread = rem.read(&buf[total_nread..])?;
self.consume(nread);
total_nread += nread;
}
Ok(total_nread);
}
There is a difference between
a) filling a BufReader
's buffer with at least n bytes
b) filling the passed buf
with at least n bytes
b) can currently be achieved by using reader.take(n).read_to_end(&mut vec)
, performing multiple read()
calls into an on-stack buffer until it has as many bytes you need. Combining Take
and io::copy
into a VecDeque
can also work if you need a ring buffer. In all those cases you wouldn't need a BufReader
to achieve that that.
We can't implement this in read()
because BufReader
may wrap a TcpStream
that has some data available, but not enough to fill the buffer then a single call to fill_buf
can return bytes but multiple calls would block the connection. This can lead to a logical deadlock the other side of the connection is waiting for a response to the data this has already been sent.
In other words we should avoid turning a single read
call into multiple read
calls unless the caller explicitly requests a minimum read length. A read
call does not imply such intent, it explicitly allows short reads which some callers may rely on.
Essentially read
is a fairly thin wrapper around the underlying read(2) syscall which has the same properties.
To achieve a) we would need a new API which does the left-shifting inside BufReader's API.
Another option, if you know for certain that those bytes should be available and that an EOF will not be encountered then you can also use read_exact
. The issue with that method is that it may drop a partial read on the floor when an EOF is encountered.
@the8472 thanks a lot for the explanation, I wasn't aware of TCP connections. Extending BufReader
would be nice, although there are workarounds you proposed. In my particular case the solution is simple -- set the inner buffer size to multiple of record size using with_capacity
constructor.
Note that even for regular files it is not guaranteed that you won't get short reads. As I understand it many filesystems will make an effort to fulfill the entire request but some (e.g. network filesystems, FUSE drivers) may not. It might also depend on the platform.