spark
spark copied to clipboard
[SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge
What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/32298 we were able to merge scalar subquery plans, but DSv2 sources couldn't benefit from that improvement.
The reason for DSv2 sources were not supported by default is that SparkOptimizer.earlyScanPushDownRules build different Scans in logical plans before MergeScalarSubqueries is executed. Those Scans can have different pushed-down filters and aggregates and different column pruning defined, which prevents merging the plans.
I would not alter the order of optimization rules as MergeScalarSubqueries works better when logical plans are better optimized (a plan is closer to its final logical form, e.g. InjectRuntimeFilter already executed). But instead, I would propose a new interface that a Scan can implement to indicate if merge is possible with another Scan and do the merge if it make sense depending on the Scan's actual parameters.
This PR:
- adds a new interface
SupportsMergethatScans can implement to indicate if 2Scans can be merged and - adds implementation of
SupportsMergetoParquetScanas the first DSv2 source. The merge only happens if pushed-down data and partition filters and pushed-down aggregates match.
Why are the changes needed?
Scalar subquery merge can bring considerable performance improvement (see the original https://github.com/apache/spark/pull/32298 for the benchmarks) so DSv2 sources should also benefit from that feature.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new UT.
cc @cloud-fan, @sigmod, @singhpk234
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
Thanks for the comments @singhpk234!
Unfortunately this PR got closed due to lack of reviews and can't be reopened. I'm happy to open a new one and take your suggestions into account, but first it would be great if a Spark committer would confirm that the proposed SupportsMerge scan interface makes sense and somone have willingness to give some feedback about the change. Any feedback is much appreciated, really.
Maybe @cloud-fan or @gengliangwang are you interested in this PR?