spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39991][SQL][AQE] Use available column statistics from completed query stages

Open andygrove opened this issue 3 years ago • 1 comments

What changes were proposed in this pull request?

AQE uses statistics from completed query stages and feeds them back into the logical optimizer. AQE currently only uses dataSize and numOutputRows and ignores any available attributeMap (column statistics).

This PR updates AQE to also populate attributeMap in the statistics that it uses for re-optimizing the plan.

Why are the changes needed?

These changes are needed so that Spark plugins that provide custom implementations of the ShuffleExchangeLike trait can leverage column statistics for better plan optimization during AQE execution.

Does this PR introduce any user-facing change?

No. The current Spark implementation of ShuffleExchangeLike (ShuffleExchangeExec) does not populate attributeMap, so this PR is a no-op for regular Spark.

How was this patch tested?

New unit test added.

andygrove avatar Aug 05 '22 21:08 andygrove

Can one of the admins verify this patch?

AmplabJenkins avatar Aug 06 '22 18:08 AmplabJenkins