spark [SPARK-40086][SQL] Improve AliasAwareOutputPartitioning to take all aliases into account

[SPARK-40086][SQL] Improve AliasAwareOutputPartitioning to take all aliases into account

Open peter-toth opened this issue 3 years ago • 2 comments

What changes were proposed in this pull request?

Currently AliasAwareOutputPartitioning takes only the last alias by aliased expressions into account. We could avoid more shuffles with better alias handling.

Why are the changes needed?

Performance improvement.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new UT.

Aug 15 '22 17:08 peter-toth

cc @cloud-fan

Aug 16 '22 08:08 peter-toth

@cloud-fan, @imback82, can you please help to review this PR?

Sep 07 '22 08:09 peter-toth

@ulysses-you, @cloud-fan, I've rebased this PR on top of master, that now includes multiTransform():

The 1st commit of this PR is a cherry-pick of the first commit from https://github.com/apache/spark/pull/39556.
The 2nd commit is from the original content of this PR and switches the implementation to multiTransform().
3rd is the aliasMap split as distussed here: https://github.com/apache/spark/pull/37525#discussion_r1070970038 and required as autoComtinue feature was removed.
4th adds early pruning, but this is a bit different to what we have in https://github.com/apache/spark/pull/39556. It is actually more efficient and we don't need "after transformation pruning".
5th (https://github.com/apache/spark/pull/37525/commits/98b7a15f1f190374884ea031373e90af14f92e35) contains test fixes in PlannerSuite as some of the previous expected results seems to be wrong. @ulysses-you please take a look as it fixes some of the tests you added in https://github.com/apache/spark/pull/39556.

Let me know your thougths.

Jan 17 '23 13:01 peter-toth

I've rebased the PR on https://github.com/apache/spark/pull/39652, that is not yet merged, so there is an extra commit (https://github.com/apache/spark/pull/37525/commits/59646bbc26476ec957fd7bff8cbae317791dc228) in this PR that doesn't belong to here, but it will disappear once https://github.com/apache/spark/pull/39652 gets merged

Jan 19 '23 15:01 peter-toth

I've rebased the PR on #39652, that is not yet merged, so there is an extra commit (59646bb) in this PR that doesn't belong to here, but it will disappear once #39652 gets merged

https://github.com/apache/spark/pull/39652 got merged, so rebased this PR on latest master.

Jan 19 '23 15:01 peter-toth

@peter-toth can you retrigger the tests? The pyspark failures may be flaky.

Jan 31 '23 01:01 cloud-fan

thanks, merging to master/3.4 (as it fixes a bug in planned write)!

Jan 31 '23 15:01 cloud-fan

Thanks for the review!

Jan 31 '23 15:01 peter-toth

spark spark copied to clipboard

[SPARK-40086][SQL] Improve AliasAwareOutputPartitioning to take all aliases into account

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard