spark [SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge

What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/32298 we were able to merge scalar subquery plans, but DSv2 sources couldn't benefit from that improvement. The reason for DSv2 sources were not supported by default is that SparkOptimizer.earlyScanPushDownRules build different Scans in logical plans before MergeScalarSubqueries is executed. Those Scans can have different pushed-down filters and aggregates and different column pruning defined, which prevents merging the plans. I would not alter the order of optimization rules as MergeScalarSubqueries works better when logical plans are better optimized (a plan is closer to its final logical form, e.g. InjectRuntimeFilter already executed). But instead, I would propose a new interface that a Scan can implement to indicate if merge is possible with another Scan and do the merge if it make sense depending on the Scan's actual parameters.

This PR:

adds a new interface SupportsMerge that Scans can implement to indicate if 2 Scans can be merged and
adds implementation of SupportsMerge to ParquetScan as the first DSv2 source. The merge only happens if pushed-down data and partition filters and pushed-down aggregates match.

Why are the changes needed?

Scalar subquery merge can bring considerable performance improvement (see the original https://github.com/apache/spark/pull/32298 for the benchmarks) so DSv2 sources should also benefit from that feature.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT.

Aug 29 '22 16:08 peter-toth

cc @cloud-fan, @sigmod, @singhpk234

Aug 29 '22 17:08 peter-toth

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Dec 15 '22 00:12 github-actions[bot]

Thanks for the comments @singhpk234! Unfortunately this PR got closed due to lack of reviews and can't be reopened. I'm happy to open a new one and take your suggestions into account, but first it would be great if a Spark committer would confirm that the proposed SupportsMerge scan interface makes sense and somone have willingness to give some feedback about the change. Any feedback is much appreciated, really.

Maybe @cloud-fan or @gengliangwang are you interested in this PR?

Dec 29 '22 11:12 peter-toth

spark spark copied to clipboard

[SPARK-40259][SQL] Support Parquet DSv2 in subquery plan merge

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard