arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[C++] Peeking BufferedInputStream past buffered bytes puts the stream in invalid state

Open biljazovic opened this issue 1 year ago • 3 comments

Describe the bug, including details regarding any error messages, version, and platform.

Minimal failing unit test:

diff --git a/cpp/src/arrow/io/buffered_test.cc b/cpp/src/arrow/io/buffered_test.cc
index cbf2c2cf0..98cabdaf1 100644
--- a/cpp/src/arrow/io/buffered_test.cc
+++ b/cpp/src/arrow/io/buffered_test.cc
@@ -474,6 +474,13 @@ TEST_F(TestBufferedInputStream, SetBufferSize) {
   ASSERT_OK(buffered_->SetBufferSize(5));
 }
 
+TEST_F(TestBufferedInputStream, PeekPastBufferedBytesTwice) {
+  MakeExample1(/*buffer_size=*/10, default_memory_pool(), /*raw_read_bound=*/15);
+  ASSERT_OK(buffered_->Read(9));
+  ASSERT_OK(buffered_->Peek(6));
+  ASSERT_OK(buffered_->Peek(6)); // fails with 'Invalid: Cannot shrink read buffer if buffered data remains'
+}
+
 // GH-43060: Internal buffer should not greater than the
 // bytes could buffer.
 TEST_F(TestBufferedInputStream, BufferSizeLimit) {

It seems that bug was introduced in #43064. Buffer is resized to at most remaining number of bytes regardless of current buffer_pos_, leaving space only for unread data, even though already read data is still in the buffer. I think it should look like this instead:

diff --git a/cpp/src/arrow/io/buffered.cc b/cpp/src/arrow/io/buffered.cc
index b41e4257a..b88cc4819 100644
--- a/cpp/src/arrow/io/buffered.cc
+++ b/cpp/src/arrow/io/buffered.cc
@@ -294,7 +294,7 @@ class BufferedInputStream::Impl : public BufferedBase {
     if (raw_read_bound_ >= 0) {
       // No need to reserve space for more than the total remaining number of bytes.
       new_buffer_size = std::min(new_buffer_size,
-                                 bytes_buffered_ + (raw_read_bound_ - raw_read_total_));
+                                 buffer_pos_ + bytes_buffered_ + (raw_read_bound_ - raw_read_total_));
     }
     return ResizeBuffer(new_buffer_size);
   }

Component(s)

C++

biljazovic avatar Sep 04 '24 12:09 biljazovic

Sorry for late replying, would you mind fix or let me fix this? @biljazovic

mapleFU avatar Oct 12 '24 08:10 mapleFU

The problem is Peek and Read both calls SetBufferSize, however:

  1. Read implicit says that, when SetBufferSize or read, the previous buffer is not being required. In this scenerio, bytes_buffered_ is always 0, since Read will consume all data. And new_buffer_size == std::min(new_buffer_size, (raw_read_bound_ - raw_read_total_))
  2. Peek still requires the buffer-size here. So, bytes_buffered_ might not 0. When Peek, the new_buffer_size would less than expected, which shrinks the buffer

🤔 I'd like to try to fix, thanks so much for reporting!

mapleFU avatar Oct 12 '24 09:10 mapleFU

No problem, thanks for taking care of it

biljazovic avatar Oct 14 '24 07:10 biljazovic

Issue resolved by pull request 44387 https://github.com/apache/arrow/pull/44387

pitrou avatar Oct 24 '24 12:10 pitrou