spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48493][PYTHON] Enhance Python Datasource Reader with direct Arrow Batch support for improved performance

Open LucaCanali opened this issue 1 year ago • 3 comments

What changes were proposed in this pull request?

This pull request proposes enhancing the Python Datasource Reader by adding an option to yield Arrow batches directly. This change aims to significantly improve performance compared to the existing approach of using tuples or Rows. The implementation takes advantage of the existing work with MapInArrow (referenced in SPARK-46253).

Why are the changes needed?

The changes are needed to address performance issues in the Python Datasource Reader. The current method of sending data as tuples or Rows is inefficient, leading to slower data processing times. By allowing the Datasource Reader to yield Arrow batches directly, we can use the more efficient Arrow format, significantly speeding up data processing. Tests have shown this approach to be up to 8x faster (in a preliminary test with a High Energy Physics Datasource reader for the ROOT data format), particularly benefiting use cases involving large datasets.

Does this PR introduce any user-facing change?

Yes, this PR introduces a user-facing change by adding an option to the Python Datasource Reader that allows users to yield Arrow batches directly.

How was this patch tested?

A new test was added to the Python Datasource test suite. Additionally, it was manually tested using a custom Python datasource for performance testing.

Was this patch authored or co-authored using generative AI tooling?

No

LucaCanali avatar May 31 '24 20:05 LucaCanali

@allisonwang-db thank you for reviewing, I like you proposal. Please have a look at the latest update.

LucaCanali avatar Jun 13 '24 13:06 LucaCanali

I will defer to @allisonwang-db

HyukjinKwon avatar Jun 30 '24 23:06 HyukjinKwon

Hi @allisonwang-db I hope you're doing well! When you have a moment, could you please review the latest changes? Your feedback would be greatly appreciated.

LucaCanali avatar Aug 13 '24 12:08 LucaCanali

@LucaCanali please re-trigger the failed tests (they seems unrelated to this change)

allisonwang-db avatar Aug 29 '24 18:08 allisonwang-db

Somehow the test org.apache.spark.sql.streaming.ClientStreamingQuerySuite keeps failing, although it seems unrelated to this change?

LucaCanali avatar Aug 29 '24 20:08 LucaCanali

Seems unrelated but mind syncing the branch to the latest master and retrigger the test? The change here isn't minor so let;s make sure the tests pass.

HyukjinKwon avatar Aug 29 '24 23:08 HyukjinKwon

I guess this should be good to go now. Any further comments?

LucaCanali avatar Sep 04 '24 08:09 LucaCanali

Thanks! Merging to master.

allisonwang-db avatar Sep 04 '24 20:09 allisonwang-db

Thank you @allisonwang-db and @HyukjinKwon !

LucaCanali avatar Sep 05 '24 07:09 LucaCanali