spark
spark copied to clipboard
[SPARK-40086][SQL] Improve AliasAwareOutputPartitioning to take all aliases into account
What changes were proposed in this pull request?
Currently AliasAwareOutputPartitioning takes only the last alias by aliased expressions into account. We could avoid more shuffles with better alias handling.
Why are the changes needed?
Performance improvement.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added new UT.
cc @cloud-fan
@cloud-fan, @imback82, can you please help to review this PR?
@ulysses-you, @cloud-fan, I've rebased this PR on top of master, that now includes multiTransform():
- The 1st commit of this PR is a cherry-pick of the first commit from https://github.com/apache/spark/pull/39556.
- The 2nd commit is from the original content of this PR and switches the implementation to
multiTransform(). - 3rd is the aliasMap split as distussed here: https://github.com/apache/spark/pull/37525#discussion_r1070970038 and required as
autoComtinuefeature was removed. - 4th adds early pruning, but this is a bit different to what we have in https://github.com/apache/spark/pull/39556. It is actually more efficient and we don't need "after transformation pruning".
- 5th (https://github.com/apache/spark/pull/37525/commits/98b7a15f1f190374884ea031373e90af14f92e35) contains test fixes in
PlannerSuiteas some of the previous expected results seems to be wrong. @ulysses-you please take a look as it fixes some of the tests you added in https://github.com/apache/spark/pull/39556.
Let me know your thougths.
I've rebased the PR on https://github.com/apache/spark/pull/39652, that is not yet merged, so there is an extra commit (https://github.com/apache/spark/pull/37525/commits/59646bbc26476ec957fd7bff8cbae317791dc228) in this PR that doesn't belong to here, but it will disappear once https://github.com/apache/spark/pull/39652 gets merged
I've rebased the PR on #39652, that is not yet merged, so there is an extra commit (59646bb) in this PR that doesn't belong to here, but it will disappear once #39652 gets merged
https://github.com/apache/spark/pull/39652 got merged, so rebased this PR on latest master.
@peter-toth can you retrigger the tests? The pyspark failures may be flaky.
thanks, merging to master/3.4 (as it fixes a bug in planned write)!
Thanks for the review!