[C++] Peeking BufferedInputStream past buffered bytes puts the stream in invalid state
Describe the bug, including details regarding any error messages, version, and platform.
Minimal failing unit test:
diff --git a/cpp/src/arrow/io/buffered_test.cc b/cpp/src/arrow/io/buffered_test.cc
index cbf2c2cf0..98cabdaf1 100644
--- a/cpp/src/arrow/io/buffered_test.cc
+++ b/cpp/src/arrow/io/buffered_test.cc
@@ -474,6 +474,13 @@ TEST_F(TestBufferedInputStream, SetBufferSize) {
ASSERT_OK(buffered_->SetBufferSize(5));
}
+TEST_F(TestBufferedInputStream, PeekPastBufferedBytesTwice) {
+ MakeExample1(/*buffer_size=*/10, default_memory_pool(), /*raw_read_bound=*/15);
+ ASSERT_OK(buffered_->Read(9));
+ ASSERT_OK(buffered_->Peek(6));
+ ASSERT_OK(buffered_->Peek(6)); // fails with 'Invalid: Cannot shrink read buffer if buffered data remains'
+}
+
// GH-43060: Internal buffer should not greater than the
// bytes could buffer.
TEST_F(TestBufferedInputStream, BufferSizeLimit) {
It seems that bug was introduced in #43064. Buffer is resized to at most remaining number of bytes regardless of current buffer_pos_, leaving space only for unread data, even though already read data is still in the buffer. I think it should look like this instead:
diff --git a/cpp/src/arrow/io/buffered.cc b/cpp/src/arrow/io/buffered.cc
index b41e4257a..b88cc4819 100644
--- a/cpp/src/arrow/io/buffered.cc
+++ b/cpp/src/arrow/io/buffered.cc
@@ -294,7 +294,7 @@ class BufferedInputStream::Impl : public BufferedBase {
if (raw_read_bound_ >= 0) {
// No need to reserve space for more than the total remaining number of bytes.
new_buffer_size = std::min(new_buffer_size,
- bytes_buffered_ + (raw_read_bound_ - raw_read_total_));
+ buffer_pos_ + bytes_buffered_ + (raw_read_bound_ - raw_read_total_));
}
return ResizeBuffer(new_buffer_size);
}
Component(s)
C++
Sorry for late replying, would you mind fix or let me fix this? @biljazovic
The problem is Peek and Read both calls SetBufferSize, however:
Readimplicit says that, whenSetBufferSizeor read, the previous buffer is not being required. In this scenerio,bytes_buffered_is always 0, sinceReadwill consume all data. Andnew_buffer_size == std::min(new_buffer_size, (raw_read_bound_ - raw_read_total_))Peekstill requires the buffer-size here. So,bytes_buffered_might not 0. WhenPeek, thenew_buffer_sizewould less than expected, which shrinks the buffer
🤔 I'd like to try to fix, thanks so much for reporting!
No problem, thanks for taking care of it
Issue resolved by pull request 44387 https://github.com/apache/arrow/pull/44387