starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

[Enhancement] Use more sophisticated statistics to infer skew for `GroupByCountDistinctDataSkewEliminateRule`

Open OliLay opened this issue 2 weeks ago โ€ข 9 comments

Why I'm doing:

Given a query like

select
    `TABLE`.`GROUP`,
    count(distinct `TABLE`.`DISTINCT`) as cnt
from
    `TABLE`
group by
    `TABLE`.`GROUP`

where GROUP may be skewed heavily, the actual aggregation of the skewed group is executed single-threaded. For this reason there is the GroupByCountDistinctDataSkewEliminateRule, which actually did not consider sophisticated column statistics. This would yield to the rule only being activated very seldomly. For example, for a case with 100M rows and the GROUP column being skewed heavily:

select `TABLE`.`GROUP`, count(`TABLE`.`GROUP`) from `TABLE` group by 1 order by 2 desc limit 5;
+-------+--------------------+
| GROUP | count(TABLE.GROUP) |
+-------+--------------------+
|     0 |           98175725 |
|     1 |              68181 |
|     2 |              40059 |
|     3 |              28474 |
|     4 |              22114 |
+-------+--------------------+

executing the above query without the change takes >50s (profile), while with the change it takes <12s (profile) due to the rewrite.

Note the whole rule is behind enable_distinct_column_bucketization (which is default false).

What I'm doing:

  • Weighting cost of GroupByCountDistinctDataSkewEliminateRule was previously only done by looking at the cardinality of the group by columns; now it is also done based on histograms and null fractions. The old logic is still there, I just added a new condition in which the rule may trigger.
  • Created DataSkew utility which can be re-used to check a column for data skew instead of implementing it in every rule itself.

What type of PR is this:

  • [ ] BugFix
  • [ ] Feature
  • [x] Enhancement
  • [ ] Refactor
  • [ ] UT
  • [ ] Doc
  • [ ] Tool

Does this PR entail a change in behavior?

  • [ ] Yes, this PR will result in a change in behavior.
  • [x] No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • [ ] Parameter changes: default values, similar parameters but with different default values
  • [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
  • [ ] Feature removed
  • [ ] Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • [x] I have added test cases for my bug fix or my new feature
  • [ ] This pr needs user documentation (for new or modified features or behaviors)
    • [ ] I have added documentation for my new feature or new function
  • [ ] This is a backport pr

Bugfix cherry-pick branch check:

  • [x] I have checked the version labels which the pr will be auto-backported to the target branch
    • [x] 4.0
    • [ ] 3.5
    • [ ] 3.4
    • [ ] 3.3

[!NOTE] Improves skew handling in count-distinct aggregations by leveraging histograms/null fractions in the cost model, introduces a reusable DataSkew utility, relocates DataSkewInfo to skew, and adds targeted tests.

  • Optimizer / Cost Model (CostModel):
    • Enhance penalty calculation for GROUP BY COUNT(DISTINCT ...) by detecting skew via histograms and null fractions using DataSkew (returns factors 0.2/0.5/1.5).
    • Add helpers: getGroupBysWithoutDistinctColumn, isGroupBySkewed, isGroupByLowCardinality.
  • Rule (GroupByCountDistinctDataSkewEliminateRule):
    • Refactor check logic for readability; behavior unchanged.
  • Operators:
    • Update to use com.starrocks.sql.optimizer.skew.DataSkewInfo in LogicalAggregationOperator and PhysicalHashAggregateOperator.
  • Skew Utilities:
    • New com.starrocks.sql.optimizer.skew.DataSkew with thresholded isColumnSkewed(...) API.
    • Move DataSkewInfo to com.starrocks.sql.optimizer.skew and simplify fields/constructor.
  • Tests:
    • Add DataSkewTest (null/histogram skew, thresholds) and DistinctAggSkewTest (plan rewrites with/without skew).

Written by Cursor Bugbot for commit 1299736197c9ed6d4bab6df60cc964e562a031b6. This will update automatically on new commits. Configure here.

OliLay avatar Dec 11 '25 10:12 OliLay

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Dec 11 '25 10:12 CLAassistant

๐Ÿงช CI Insights

Here's what we observed from your CI run for 15224e78.

๐ŸŸข All jobs passed!

But CI Insights is watching ๐Ÿ‘€

mergify[bot] avatar Dec 11 '25 10:12 mergify[bot]

@cursor review

alvin-celerdata avatar Dec 11 '25 17:12 alvin-celerdata

please fix the code style first

stephen-shelby avatar Dec 12 '25 08:12 stephen-shelby

please fix the code style first

Done 458a2b6

OliLay avatar Dec 12 '25 08:12 OliLay

@cursor review

alvin-celerdata avatar Dec 12 '25 17:12 alvin-celerdata

The profiles in the pr description are the same. all take 56s

stephen-shelby avatar Dec 14 '25 11:12 stephen-shelby

The profiles in the pr description are the same. all take 56s

Sorry, linked the wrong gist, should now be the correct one

OliLay avatar Dec 14 '25 12:12 OliLay

[Java-Extensions Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] avatar Dec 15 '25 12:12 github-actions[bot]

[BE Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] avatar Dec 15 '25 12:12 github-actions[bot]

[FE Incremental Coverage Report]

:white_check_mark: pass : 75 / 86 (87.21%)

file detail

path covered_line new_line coverage not_covered_line_detail
:large_blue_circle: com/starrocks/sql/optimizer/skew/DataSkewInfo.java 8 14 57.14% [33, 34, 41, 42, 45, 46]
:large_blue_circle: com/starrocks/sql/optimizer/cost/CostModel.java 34 37 91.89% [297, 313, 324]
:large_blue_circle: com/starrocks/sql/optimizer/skew/DataSkew.java 25 27 92.59% [26, 37]
:large_blue_circle: com/starrocks/qe/SessionVariable.java 4 4 100.00% []
:large_blue_circle: com/starrocks/sql/optimizer/rule/transformation/GroupByCountDistinctDataSkewEliminateRule.java 4 4 100.00% []

github-actions[bot] avatar Dec 15 '25 12:12 github-actions[bot]

Hey @satanson , can you have another look? I had to dismiss your approval because I incorporated some of Murphys suggested changes.

OliLay avatar Dec 16 '25 07:12 OliLay

@Mergifyio backport branch-4.0

github-actions[bot] avatar Dec 18 '25 09:12 github-actions[bot]

backport branch-4.0

โœ… Backports have been created

mergify[bot] avatar Dec 18 '25 09:12 mergify[bot]