hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-28489: Partition the input data of GroupBy with GroupingSet

Open ngsg opened this issue 1 year ago • 2 comments

What changes were proposed in this pull request?

This patch introduces a new optimization for GroupBy with GroupingSet. GroupingSet often emits too much rows and becomes a bottleneck of query execution. The proposed optimization tries mitigating this situation by reducing the cardinality of GroupBy key. For the detailed explanation, please checkout the slides attached in the JIRA page (HIVE-28489).

This patch introduces a new configuration key: hive.optimize.grouping.set.threshold. The default value of 1,000,000,000 is chosen because it showed best performance in 10TB TPC-DS experiment. Setting this value to negative number disables the optimization.

Why are the changes needed?

To improve query execution time of GroupBy with GroupingSet.

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

We tested this patch using 10TB TPC-DS experiment. This patch contains a qfile test to verify optimized query plan.

ngsg avatar Aug 30 '24 05:08 ngsg

The intention is admirable, but I am not 100% confident that this is the best approach because I need to know how GROUPING SETS works. It still needs time for me

okumin avatar Nov 13 '24 10:11 okumin

The intention is admirable, but I am not 100% confident that this is the best approach because I need to know how GROUPING SETS works. It still needs time for me

Hello. @okumin Trino seems to have adopted a similar scheme to optimize grouping sets. It appears that the modifications in this submission are intended to port the optimization rules from Trino to HIVE. Since Trino has been using this rule for a long time, and we have received positive feedback from users after introducing this patch, I personally believe that the approach used in this submission is a viable solution. After all, if we find a better solution, it would be quite easy for us to replace the changes introduced by this PR.

BsoBird avatar Nov 15 '24 03:11 BsoBird

@BsoBird Thanks. Could you give us a document or equivalent implementation of Trino?

okumin avatar Nov 18 '24 11:11 okumin