spark
spark copied to clipboard
[SPARK-49547][SQL][PYTHON] Support returning iterator of RecordBatches in applyInArrow
What changes were proposed in this pull request?
Add the option to applyInArrow
to take a function that takes an iterator of RecordBatch
and returns an iterator of RecordBatch
.
Why are the changes needed?
Being limited to returning a single Table requires collecting all results in memory for a single batch. This can require excessive memory for certain edge cases that might require returning a larger number of rows. Currently the Python worker immediately converts a table into it's underlying batches, so there's barely any changes required to accommodate this.
Does this PR introduce any user-facing change?
Yes, a new function signature supported by applyInArrow
How was this patch tested?
Updated existing UTs to test both Table signatures and RecordBatch signatures
Was this patch authored or co-authored using generative AI tooling?
No