spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-49547][SQL][PYTHON] Support returning iterator of RecordBatches in applyInArrow

Open Kimahriman opened this issue 5 months ago • 14 comments

What changes were proposed in this pull request?

Add the option to applyInArrow to take a function that takes an iterator of RecordBatch and returns an iterator of RecordBatch.

Why are the changes needed?

Being limited to returning a single Table requires collecting all results in memory for a single batch. This can require excessive memory for certain edge cases that might require returning a larger number of rows. Currently the Python worker immediately converts a table into it's underlying batches, so there's barely any changes required to accommodate this.

Does this PR introduce any user-facing change?

Yes, a new function signature supported by applyInArrow

How was this patch tested?

Updated existing UTs to test both Table signatures and RecordBatch signatures

Was this patch authored or co-authored using generative AI tooling?

No

Kimahriman avatar Sep 09 '24 16:09 Kimahriman