iceberg-python [bug] `to_arrow_batch_reader` does not respect the given limit, returning more records than specified

Apache Iceberg version

main (development)

Please describe the bug 🐞

`to_arrow_batch_reader` bug

The bug is in project_batches, specifically with the way yield interacts with the two for-loops.

Here’s a Jupyter notebook reproducing the issue, see the last cell and the number of rows read by using to_arrow vs to_arrow_batch_reader.

This only occurs when there are more than one data files, which is not covered in tests.

For more details, see https://github.com/apache/iceberg-python/issues/1032#issuecomment-2282819711

Aug 11 '24 16:08 kevinjqliu

Hi @kevinjqliu thank you for raising this issue. If I understand it correctly, the bug occurs because we are resetting the limit counter, if the limit specified is larger than the size of a single data file, is that correct?

Aug 11 '24 23:08 sungwy

The code works correctly for 1 data file with a given limit.

The bug is when there are 2 data files, which means 2 FileScanTasks. For example, given an iceberg table with 2 data files, each 100MB. Let's read with a limit of 1, tbl.scan(limit=1). to_arrow_batch_reader().

In this scenario, here's the order of events,

The first task will produce batches
The first batch is processed, and the limit is satisfied.
A slice of the batch is yielded
The break statement will break out of the inner loop (the for batch in batches loop)
The next task will then be processed, which will go through the above again and yield another slice

Aug 11 '24 23:08 kevinjqliu

Something like this, #1042

Aug 11 '24 23:08 kevinjqliu

[bug] `to_arrow_batch_reader` does not respect the given limit, returning more records than specified

Apache Iceberg version

Please describe the bug 🐞

to_arrow_batch_reader bug

`to_arrow_batch_reader` bug