HIVE-28489: Partition the input data of GroupBy with GroupingSet
What changes were proposed in this pull request?
This patch introduces a new optimization for GroupBy with GroupingSet. GroupingSet often emits too much rows and becomes a bottleneck of query execution. The proposed optimization tries mitigating this situation by reducing the cardinality of GroupBy key. For the detailed explanation, please checkout the slides attached in the JIRA page (HIVE-28489).
This patch introduces a new configuration key: hive.optimize.grouping.set.threshold. The default value of 1,000,000,000 is chosen because it showed best performance in 10TB TPC-DS experiment. Setting this value to negative number disables the optimization.
Why are the changes needed?
To improve query execution time of GroupBy with GroupingSet.
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
We tested this patch using 10TB TPC-DS experiment. This patch contains a qfile test to verify optimized query plan.
Quality Gate passed
Issues
13 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
The intention is admirable, but I am not 100% confident that this is the best approach because I need to know how GROUPING SETS works. It still needs time for me
The intention is admirable, but I am not 100% confident that this is the best approach because I need to know how GROUPING SETS works. It still needs time for me
Hello. @okumin Trino seems to have adopted a similar scheme to optimize grouping sets. It appears that the modifications in this submission are intended to port the optimization rules from Trino to HIVE. Since Trino has been using this rule for a long time, and we have received positive feedback from users after introducing this patch, I personally believe that the approach used in this submission is a viable solution. After all, if we find a better solution, it would be quite easy for us to replace the changes introduced by this PR.
@BsoBird Thanks. Could you give us a document or equivalent implementation of Trino?
Quality Gate passed
Issues
9 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code