[SPARK-48493][PYTHON] Enhance Python Datasource Reader with direct Arrow Batch support for improved performance
What changes were proposed in this pull request?
This pull request proposes enhancing the Python Datasource Reader by adding an option to yield Arrow batches directly. This change aims to significantly improve performance compared to the existing approach of using tuples or Rows. The implementation takes advantage of the existing work with MapInArrow (referenced in SPARK-46253).
Why are the changes needed?
The changes are needed to address performance issues in the Python Datasource Reader. The current method of sending data as tuples or Rows is inefficient, leading to slower data processing times. By allowing the Datasource Reader to yield Arrow batches directly, we can use the more efficient Arrow format, significantly speeding up data processing. Tests have shown this approach to be up to 8x faster (in a preliminary test with a High Energy Physics Datasource reader for the ROOT data format), particularly benefiting use cases involving large datasets.
Does this PR introduce any user-facing change?
Yes, this PR introduces a user-facing change by adding an option to the Python Datasource Reader that allows users to yield Arrow batches directly.
How was this patch tested?
A new test was added to the Python Datasource test suite. Additionally, it was manually tested using a custom Python datasource for performance testing.
Was this patch authored or co-authored using generative AI tooling?
No
@allisonwang-db thank you for reviewing, I like you proposal. Please have a look at the latest update.
I will defer to @allisonwang-db
Hi @allisonwang-db I hope you're doing well! When you have a moment, could you please review the latest changes? Your feedback would be greatly appreciated.
@LucaCanali please re-trigger the failed tests (they seems unrelated to this change)
Somehow the test org.apache.spark.sql.streaming.ClientStreamingQuerySuite keeps failing, although it seems unrelated to this change?
Seems unrelated but mind syncing the branch to the latest master and retrigger the test? The change here isn't minor so let;s make sure the tests pass.
I guess this should be good to go now. Any further comments?
Thanks! Merging to master.
Thank you @allisonwang-db and @HyukjinKwon !