[Enhancement] Optimize deltaRows with lazy evaluation for large partition tables (backport #66381)
Why I'm doing:
When a table has a large number of partitions (e.g., 36,000), the deltaRows method is called during statistics calculation for every DeriveStatsTask. Previously, deltaRows was invoked unconditionally in getPartitionRows, even when no partitions actually needed delta compensation. This caused severe performance issues:
- Unnecessary computation:
deltaRowsiterates through all partitions and queries statistics, even whendeltaRowsresult is never used - Expensive operations: Each
deltaRowscall involves:- Querying statistics for all 36,000 partitions
- Iterating through all partitions
- Multiple HashMap operations
The root cause is that deltaRows was computed eagerly at the beginning of getPartitionRows, but only used conditionally when needDelta is true.
What I'm doing:
Implement lazy evaluation of deltaRows in getPartitionRows:
Fixes #issue
What type of PR is this:
- [ ] BugFix
- [ ] Feature
- [x] Enhancement
- [ ] Refactor
- [ ] UT
- [ ] Doc
- [ ] Tool
Does this PR entail a change in behavior?
- [ ] Yes, this PR will result in a change in behavior.
- [x] No, this PR will not result in a change in behavior.
If yes, please specify the type of change:
- [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
- [ ] Parameter changes: default values, similar parameters but with different default values
- [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
- [ ] Feature removed
- [ ] Miscellaneous: upgrade & downgrade compatibility, etc.
Checklist:
- [ ] I have added test cases for my bug fix or my new feature
- [ ] This pr needs user documentation (for new or modified features or behaviors)
- [ ] I have added documentation for my new feature or new function
- [x] This is a backport pr
Bugfix cherry-pick branch check:
- [x] I have checked the version labels which the pr will be auto-backported to the target branch
- [x] 4.0
- [x] 3.5
- [x] 3.4
- [x] 3.3
[!NOTE] Lazily computes deltaRows in getPartitionRows and only when needed, reducing overhead on large partition tables.
- Optimizer/Statistics (
StatisticsCalcUtils#getPartitionRows):
- Implement lazy evaluation of
deltaRows(compute once, only if any partition needs delta compensation).- Introduce
statsUpdateTimeandneedDeltato streamline conditions for applyingdeltaRows.- Preserve existing behavior while avoiding unnecessary full-partition scans when stats are up-to-date.
Written by Cursor Bugbot for commit b05ff64772558f0f3be1d4ae0cfb0eeb16d26898. This will update automatically on new commits. Configure here.
This is an automatic backport of pull request #66381 done by [Mergify](https://mergify.com).
[!NOTE] Lazily computes
deltaRowsonly when at least one partition needs it, reducing overhead for large-partition tables while preserving behavior.
- Optimizer/Statistics:
StatisticsCalcUtils#getPartitionRows:
- Implement lazy evaluation of
deltaRows(compute once, only if any partition needs compensation).- Introduce
statsUpdateTimeandneedDeltato streamline conditions usinglastWorkTimestamp/updateTime.- Avoids unnecessary full-partition scans when stats are up-to-date.
Written by Cursor Bugbot for commit 70868b27f2a26cfda95863ba8c8e3176db11cd3c. This will update automatically on new commits. Configure here.
🧪 CI Insights
Here's what we observed from your CI run for 70868b27.
🟢 All jobs passed!
But CI Insights is watching 👀
@cursor review