[Enhancement] Optimize deltaRows with lazy evaluation for large partition tables (backport #66381)
Why I'm doing:
When a table has a large number of partitions (e.g., 36,000), the deltaRows method is called during statistics calculation for every DeriveStatsTask. Previously, deltaRows was invoked unconditionally in getPartitionRows, even when no partitions actually needed delta compensation. This caused severe performance issues:
- Unnecessary computation:
deltaRowsiterates through all partitions and queries statistics, even whendeltaRowsresult is never used - Expensive operations: Each
deltaRowscall involves:- Querying statistics for all 36,000 partitions
- Iterating through all partitions
- Multiple HashMap operations
The root cause is that deltaRows was computed eagerly at the beginning of getPartitionRows, but only used conditionally when needDelta is true.
What I'm doing:
Implement lazy evaluation of deltaRows in getPartitionRows:
Fixes #issue
What type of PR is this:
- [ ] BugFix
- [ ] Feature
- [x] Enhancement
- [ ] Refactor
- [ ] UT
- [ ] Doc
- [ ] Tool
Does this PR entail a change in behavior?
- [ ] Yes, this PR will result in a change in behavior.
- [x] No, this PR will not result in a change in behavior.
If yes, please specify the type of change:
- [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
- [ ] Parameter changes: default values, similar parameters but with different default values
- [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
- [ ] Feature removed
- [ ] Miscellaneous: upgrade & downgrade compatibility, etc.
Checklist:
- [ ] I have added test cases for my bug fix or my new feature
- [ ] This pr needs user documentation (for new or modified features or behaviors)
- [ ] I have added documentation for my new feature or new function
- [x] This is a backport pr
Bugfix cherry-pick branch check:
- [x] I have checked the version labels which the pr will be auto-backported to the target branch
- [x] 4.0
- [x] 3.5
- [x] 3.4
- [x] 3.3
[!NOTE] Lazily computes deltaRows in getPartitionRows and only when needed, reducing overhead on large partition tables.
- Optimizer/Statistics (
StatisticsCalcUtils#getPartitionRows):
- Implement lazy evaluation of
deltaRows(compute once, only if any partition needs delta compensation).- Introduce
statsUpdateTimeandneedDeltato streamline conditions for applyingdeltaRows.- Preserve existing behavior while avoiding unnecessary full-partition scans when stats are up-to-date.
Written by Cursor Bugbot for commit b05ff64772558f0f3be1d4ae0cfb0eeb16d26898. This will update automatically on new commits. Configure here.
This is an automatic backport of pull request #66381 done by [Mergify](https://mergify.com).
[!NOTE] Computes
deltaRowsonly when any partition needs delta compensation, reducing unnecessary full-partition scans during stats calculation.
- Optimizer/Statistics (
StatisticsCalcUtils#getPartitionRows):
- Implement lazy evaluation for
deltaRows(compute once, only if needed per-partition).- Introduce
statsUpdateTimeandneedDeltato simplify delta application conditions.- Preserve row count logic while avoiding costly iteration over all partitions when stats are current.
Written by Cursor Bugbot for commit e9180752fa9e889982c32be041a41205f090820c. This will update automatically on new commits. Configure here.
๐งช CI Insights
Here's what we observed from your CI run for e9180752.
โ Passed Jobs With Interesting Signals
| Pipeline | Job | Signal | Health on branch-3.5 |
Retries | ๐ CI Insights | ๐ Logs |
|---|---|---|---|---|---|---|
CI PIPELINE - BRANCH |
FE UT |
Base branch is broken, but retries were needed. Could be early signs of flakiness ๐ | 1 |
View | View | |
RUN CHECKER |
Base branch is healthy, but retries were needed. Could be early signs of flakiness ๐ | 1 |
View | View | ||
PR CHECKER |
automerge-check |
Base branch is broken, but the job passed. Looks like this might be a real fix ๐ช | 0 |
View | View |
@cursor review