[bug] `to_arrow_batch_reader` does not respect the given limit, returning more records than specified
Apache Iceberg version
main (development)
Please describe the bug 🐞
to_arrow_batch_reader bug
The bug is in project_batches, specifically with the way yield interacts with the two for-loops.
Here’s a Jupyter notebook reproducing the issue, see the last cell and the number of rows read by using to_arrow vs to_arrow_batch_reader.
This only occurs when there are more than one data files, which is not covered in tests.
For more details, see https://github.com/apache/iceberg-python/issues/1032#issuecomment-2282819711
Hi @kevinjqliu thank you for raising this issue. If I understand it correctly, the bug occurs because we are resetting the limit counter, if the limit specified is larger than the size of a single data file, is that correct?
The code works correctly for 1 data file with a given limit.
The bug is when there are 2 data files, which means 2 FileScanTasks. For example, given an iceberg table with 2 data files, each 100MB. Let's read with a limit of 1, tbl.scan(limit=1). to_arrow_batch_reader().
In this scenario, here's the order of events,
- The first task will produce
batches - The first
batchis processed, and the limit is satisfied. - A slice of the
batchis yielded - The
breakstatement will break out of the inner loop (thefor batch in batchesloop) - The next task will then be processed, which will go through the above again and yield another slice
Something like this, #1042