iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

[bug] `to_arrow_batch_reader` does not respect the given limit, returning more records than specified

Open kevinjqliu opened this issue 1 year ago • 3 comments

Apache Iceberg version

main (development)

Please describe the bug 🐞

to_arrow_batch_reader bug

The bug is in project_batches, specifically with the way yield interacts with the two for-loops.

Here’s a Jupyter notebook reproducing the issue, see the last cell and the number of rows read by using to_arrow vs to_arrow_batch_reader.

This only occurs when there are more than one data files, which is not covered in tests.

For more details, see https://github.com/apache/iceberg-python/issues/1032#issuecomment-2282819711

kevinjqliu avatar Aug 11 '24 16:08 kevinjqliu

Hi @kevinjqliu thank you for raising this issue. If I understand it correctly, the bug occurs because we are resetting the limit counter, if the limit specified is larger than the size of a single data file, is that correct?

sungwy avatar Aug 11 '24 23:08 sungwy

The code works correctly for 1 data file with a given limit.

The bug is when there are 2 data files, which means 2 FileScanTasks. For example, given an iceberg table with 2 data files, each 100MB. Let's read with a limit of 1, tbl.scan(limit=1). to_arrow_batch_reader().

In this scenario, here's the order of events,

  • The first task will produce batches
  • The first batch is processed, and the limit is satisfied.
  • A slice of the batch is yielded
  • The break statement will break out of the inner loop (the for batch in batches loop)
  • The next task will then be processed, which will go through the above again and yield another slice

kevinjqliu avatar Aug 11 '24 23:08 kevinjqliu

Something like this, #1042

kevinjqliu avatar Aug 11 '24 23:08 kevinjqliu