Why I'm doing:

Given a query like

select
    `TABLE`.`GROUP`,
    count(distinct `TABLE`.`DISTINCT`) as cnt
from
    `TABLE`
group by
    `TABLE`.`GROUP`

where GROUP may be skewed heavily, the actual aggregation of the skewed group is executed single-threaded. For this reason there is the GroupByCountDistinctDataSkewEliminateRule, which actually did not consider sophisticated column statistics. This would yield to the rule only being activated very seldomly. For example, for a case with 100M rows and the GROUP column being skewed heavily:

select `TABLE`.`GROUP`, count(`TABLE`.`GROUP`) from `TABLE` group by 1 order by 2 desc limit 5;
+-------+--------------------+
| GROUP | count(TABLE.GROUP) |
+-------+--------------------+
|     0 |           98175725 |
|     1 |              68181 |
|     2 |              40059 |
|     3 |              28474 |
|     4 |              22114 |
+-------+--------------------+

executing the above query without the change takes >50s (profile), while with the change it takes <12s (profile) due to the rewrite.

Note the whole rule is behind enable_distinct_column_bucketization (which is default false).

What I'm doing:

Weighting cost of GroupByCountDistinctDataSkewEliminateRule was previously only done by looking at the cardinality of the group by columns; now it is also done based on histograms and null fractions. The old logic is still there, I just added a new condition in which the rule may trigger.
Created DataSkew utility which can be re-used to check a column for data skew instead of implementing it in every rule itself.

What type of PR is this:

[ ] BugFix
[ ] Feature
[x] Enhancement
[ ] Refactor
[ ] UT
[ ] Doc
[ ] Tool

Does this PR entail a change in behavior?

[ ] Yes, this PR will result in a change in behavior.
[x] No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

[ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
[ ] Parameter changes: default values, similar parameters but with different default values
[ ] Policy changes: use new policy to replace old one, functionality automatically enabled
[ ] Feature removed
[ ] Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

[x] I have added test cases for my bug fix or my new feature
[ ] This pr needs user documentation (for new or modified features or behaviors)
- [ ] I have added documentation for my new feature or new function
[ ] This is a backport pr

Bugfix cherry-pick branch check:

[x] I have checked the version labels which the pr will be auto-backported to the target branch
- [x] 4.0
- [ ] 3.5
- [ ] 3.4
- [ ] 3.3

[!NOTE] Improves skew handling in count-distinct aggregations by leveraging histograms/null fractions in the cost model, introduces a reusable DataSkew utility, relocates DataSkewInfo to skew, and adds targeted tests.

Optimizer / Cost Model (CostModel):

Enhance penalty calculation for GROUP BY COUNT(DISTINCT ...) by detecting skew via histograms and null fractions using DataSkew (returns factors 0.2/0.5/1.5).

Add helpers: getGroupBysWithoutDistinctColumn, isGroupBySkewed, isGroupByLowCardinality.

Rule (GroupByCountDistinctDataSkewEliminateRule):

Refactor check logic for readability; behavior unchanged.

Operators:

Update to use com.starrocks.sql.optimizer.skew.DataSkewInfo in LogicalAggregationOperator and PhysicalHashAggregateOperator.

Skew Utilities:

New com.starrocks.sql.optimizer.skew.DataSkew with thresholded isColumnSkewed(...) API.

Move DataSkewInfo to com.starrocks.sql.optimizer.skew and simplify fields/constructor.

Tests:

Add DataSkewTest (null/histogram skew, thresholds) and DistinctAggSkewTest (plan rewrites with/without skew).

^{Written by Cursor Bugbot for commit 1299736197c9ed6d4bab6df60cc964e562a031b6. This will update automatically on new commits. Configure here.}

Dec 11 '25 10:12 OliLay

All committers have signed the CLA.

Dec 11 '25 10:12 CLAassistant

🧪 CI Insights

Here's what we observed from your CI run for 15224e78.

🟢 All jobs passed!

But CI Insights is watching 👀

Dec 11 '25 10:12 mergify[bot]

@cursor review

Dec 11 '25 17:12 alvin-celerdata

please fix the code style first

Dec 12 '25 08:12 stephen-shelby

please fix the code style first

Done 458a2b6

Dec 12 '25 08:12 OliLay

@cursor review

Dec 12 '25 17:12 alvin-celerdata

Quality Gate passed

Issues
11 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Dec 13 '25 01:12 sonarqubecloud[bot]

The profiles in the pr description are the same. all take 56s

Dec 14 '25 11:12 stephen-shelby

The profiles in the pr description are the same. all take 56s

Sorry, linked the wrong gist, should now be the correct one

Dec 14 '25 12:12 OliLay

[Java-Extensions Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

Dec 15 '25 12:12 github-actions[bot]

[BE Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

Dec 15 '25 12:12 github-actions[bot]

[FE Incremental Coverage Report]

:white_check_mark: pass : 75 / 86 (87.21%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
:large_blue_circle:	com/starrocks/sql/optimizer/skew/DataSkewInfo.java	8	14	57.14%	[33, 34, 41, 42, 45, 46]
:large_blue_circle:	com/starrocks/sql/optimizer/cost/CostModel.java	34	37	91.89%	[297, 313, 324]
:large_blue_circle:	com/starrocks/sql/optimizer/skew/DataSkew.java	25	27	92.59%	[26, 37]
:large_blue_circle:	com/starrocks/qe/SessionVariable.java	4	4	100.00%	[]
:large_blue_circle:	com/starrocks/sql/optimizer/rule/transformation/GroupByCountDistinctDataSkewEliminateRule.java	4	4	100.00%	[]

Dec 15 '25 12:12 github-actions[bot]

Hey @satanson , can you have another look? I had to dismiss your approval because I incorporated some of Murphys suggested changes.

Dec 16 '25 07:12 OliLay

@Mergifyio backport branch-4.0

Dec 18 '25 09:12 github-actions[bot]

backport branch-4.0

✅ Backports have been created

#66955 [Enhancement] Use more sophisticated statistics to infer skew for GroupByCountDistinctDataSkewEliminateRule (backport #66640) has been created for branch branch-4.0 but encountered conflicts

Dec 18 '25 09:12 mergify[bot]

[Enhancement] Use more sophisticated statistics to infer skew for `GroupByCountDistinctDataSkewEliminateRule`

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

🧪 CI Insights

🟢 All jobs passed!

Quality Gate passed

[Java-Extensions Incremental Coverage Report]

[BE Incremental Coverage Report]

[FE Incremental Coverage Report]

file detail

✅ Backports have been created