[Enhancement] Use more sophisticated statistics to infer skew for `GroupByCountDistinctDataSkewEliminateRule`
Why I'm doing:
Given a query like
select
`TABLE`.`GROUP`,
count(distinct `TABLE`.`DISTINCT`) as cnt
from
`TABLE`
group by
`TABLE`.`GROUP`
where GROUP may be skewed heavily, the actual aggregation of the skewed group is executed single-threaded. For this reason there is the GroupByCountDistinctDataSkewEliminateRule, which actually did not consider sophisticated column statistics. This would yield to the rule only being activated very seldomly.
For example, for a case with 100M rows and the GROUP column being skewed heavily:
select `TABLE`.`GROUP`, count(`TABLE`.`GROUP`) from `TABLE` group by 1 order by 2 desc limit 5;
+-------+--------------------+
| GROUP | count(TABLE.GROUP) |
+-------+--------------------+
| 0 | 98175725 |
| 1 | 68181 |
| 2 | 40059 |
| 3 | 28474 |
| 4 | 22114 |
+-------+--------------------+
executing the above query without the change takes >50s (profile), while with the change it takes <12s (profile) due to the rewrite.
Note the whole rule is behind enable_distinct_column_bucketization (which is default false).
What I'm doing:
- Weighting cost of
GroupByCountDistinctDataSkewEliminateRulewas previously only done by looking at the cardinality of the group by columns; now it is also done based on histograms and null fractions. The old logic is still there, I just added a new condition in which the rule may trigger. - Created
DataSkewutility which can be re-used to check a column for data skew instead of implementing it in every rule itself.
What type of PR is this:
- [ ] BugFix
- [ ] Feature
- [x] Enhancement
- [ ] Refactor
- [ ] UT
- [ ] Doc
- [ ] Tool
Does this PR entail a change in behavior?
- [ ] Yes, this PR will result in a change in behavior.
- [x] No, this PR will not result in a change in behavior.
If yes, please specify the type of change:
- [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
- [ ] Parameter changes: default values, similar parameters but with different default values
- [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
- [ ] Feature removed
- [ ] Miscellaneous: upgrade & downgrade compatibility, etc.
Checklist:
- [x] I have added test cases for my bug fix or my new feature
- [ ] This pr needs user documentation (for new or modified features or behaviors)
- [ ] I have added documentation for my new feature or new function
- [ ] This is a backport pr
Bugfix cherry-pick branch check:
- [x] I have checked the version labels which the pr will be auto-backported to the target branch
- [x] 4.0
- [ ] 3.5
- [ ] 3.4
- [ ] 3.3
[!NOTE] Improves skew handling in count-distinct aggregations by leveraging histograms/null fractions in the cost model, introduces a reusable
DataSkewutility, relocatesDataSkewInfotoskew, and adds targeted tests.
- Optimizer / Cost Model (
CostModel):
- Enhance penalty calculation for
GROUP BY COUNT(DISTINCT ...)by detecting skew via histograms and null fractions usingDataSkew(returns factors0.2/0.5/1.5).- Add helpers:
getGroupBysWithoutDistinctColumn,isGroupBySkewed,isGroupByLowCardinality.- Rule (
GroupByCountDistinctDataSkewEliminateRule):
- Refactor
checklogic for readability; behavior unchanged.- Operators:
- Update to use
com.starrocks.sql.optimizer.skew.DataSkewInfoinLogicalAggregationOperatorandPhysicalHashAggregateOperator.- Skew Utilities:
- New
com.starrocks.sql.optimizer.skew.DataSkewwith thresholdedisColumnSkewed(...)API.- Move
DataSkewInfotocom.starrocks.sql.optimizer.skewand simplify fields/constructor.- Tests:
- Add
DataSkewTest(null/histogram skew, thresholds) andDistinctAggSkewTest(plan rewrites with/without skew).Written by Cursor Bugbot for commit 1299736197c9ed6d4bab6df60cc964e562a031b6. This will update automatically on new commits. Configure here.
๐งช CI Insights
Here's what we observed from your CI run for 15224e78.
๐ข All jobs passed!
But CI Insights is watching ๐
@cursor review
please fix the code style first
@cursor review
Quality Gate passed
Issues
11 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
The profiles in the pr description are the same. all take 56s
The profiles in the pr description are the same. all take 56s
Sorry, linked the wrong gist, should now be the correct one
[Java-Extensions Incremental Coverage Report]
:white_check_mark: pass : 0 / 0 (0%)
[BE Incremental Coverage Report]
:white_check_mark: pass : 0 / 0 (0%)
[FE Incremental Coverage Report]
:white_check_mark: pass : 75 / 86 (87.21%)
file detail
| path | covered_line | new_line | coverage | not_covered_line_detail | |
|---|---|---|---|---|---|
| :large_blue_circle: | com/starrocks/sql/optimizer/skew/DataSkewInfo.java | 8 | 14 | 57.14% | [33, 34, 41, 42, 45, 46] |
| :large_blue_circle: | com/starrocks/sql/optimizer/cost/CostModel.java | 34 | 37 | 91.89% | [297, 313, 324] |
| :large_blue_circle: | com/starrocks/sql/optimizer/skew/DataSkew.java | 25 | 27 | 92.59% | [26, 37] |
| :large_blue_circle: | com/starrocks/qe/SessionVariable.java | 4 | 4 | 100.00% | [] |
| :large_blue_circle: | com/starrocks/sql/optimizer/rule/transformation/GroupByCountDistinctDataSkewEliminateRule.java | 4 | 4 | 100.00% | [] |
Hey @satanson , can you have another look? I had to dismiss your approval because I incorporated some of Murphys suggested changes.
@Mergifyio backport branch-4.0
backport branch-4.0
โ Backports have been created
- #66955 [Enhancement] Use more sophisticated statistics to infer skew for
GroupByCountDistinctDataSkewEliminateRule(backport #66640) has been created for branchbranch-4.0but encountered conflicts