statistics: use another way to merge topn
What problem does this PR solve?
Issue Number: ref #50761
Problem Summary:
What is changed and how it works?
We use the property that items inside both the TopN and the histogram are ordered to speed up the process.
And use a heuristic cutting:
- We record the avg num per distinct value of each histogram
notNullCount / ndv_in_hist. - Then for a topn item, we can use the mentioned property to know the TopNs that contain the item.
- Then we can easily know in which partition the item occurs in the histogram.
- The maximum possible occurrence of this item is
the sum occurrence of the affected TopNs + the avg num per distinct value of each affected histogram - If the maximum possible occurrence is still smaller than the currently maintained smallest global TopN item, we don't need to insert this one into the heap and test it.
In this way, the speed is improved while the CPU is saved.
The CPU usage. Previous VS This pull.
Running tool: /DATA/disk4/yiding/go/bin/go test -benchmem -run=^$ -tags intest,deadlock -bench ^BenchmarkMergePartTopN2GlobalTopNWithHists$ github.com/pingcap/tidb/pkg/statistics/handle/globalstats
goos: linux
goarch: amd64
pkg: github.com/pingcap/tidb/pkg/statistics/handle/globalstats
cpu: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
BenchmarkMergePartTopN2GlobalTopNWithHists/Size100-72 159 7249785 ns/op 34714 B/op 113 allocs/op
BenchmarkMergePartTopN2GlobalTopNWithHists/Size1000-72 9 117386869 ns/op 100136 B/op 1013 allocs/op
BenchmarkMergePartTopN2GlobalTopNWithHists/Size2000-72 4 272247218 ns/op 173416 B/op 2013 allocs/op
BenchmarkMergePartTopN2GlobalTopNWithHists/Size5000-72 2 787035728 ns/op 393256 B/op 5013 allocs/op
BenchmarkMergePartTopN2GlobalTopNWithHists/Size10000-72 1 1919716271 ns/op 759656 B/op 10013 allocs/op
PASS
ok github.com/pingcap/tidb/pkg/statistics/handle/globalstats 17.839s
Running tool: /DATA/disk4/yiding/go/bin/go test -benchmem -run=^$ -tags intest,deadlock -bench ^BenchmarkMergePartTopN2GlobalTopNWithHists$ github.com/pingcap/tidb/pkg/statistics/handle/globalstats
goos: linux
goarch: amd64
pkg: github.com/pingcap/tidb/pkg/statistics/handle/globalstats
cpu: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
BenchmarkMergePartTopN2GlobalTopNWithHists/Size100-72 122 10071203 ns/op 177982 B/op 30 allocs/op
BenchmarkMergePartTopN2GlobalTopNWithHists/Size1000-72 6 186292872 ns/op 176418 B/op 28 allocs/op
BenchmarkMergePartTopN2GlobalTopNWithHists/Size2000-72 3 398591599 ns/op 178269 B/op 31 allocs/op
BenchmarkMergePartTopN2GlobalTopNWithHists/Size5000-72 1 1436609512 ns/op 179432 B/op 33 allocs/op
BenchmarkMergePartTopN2GlobalTopNWithHists/Size10000-72 1 4363627038 ns/op 177336 B/op 31 allocs/op
PASS
ok github.com/pingcap/tidb/pkg/statistics/handle/globalstats 18.587s
n=100, 7249785/10071203=72%
n=1000 117386869/186292872=63%
n=2000 272247218/398591599=68%
n=5000 787035728/1436609512=55%
n=10000 1919716271/4363627038=44%
The mem usage is also in an acceptable range.
Check List
Tests
- [x] Unit test
- [x] Integration test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test
- [ ] I checked and no code files have been changed.
Side effects
- [ ] Performance regression: Consumes more CPU
- [ ] Performance regression: Consumes more Memory
- [ ] Breaking backward compatibility
Documentation
- [ ] Affects user behaviors
- [ ] Contains syntax changes
- [ ] Contains variable changes
- [ ] Contains experimental features
- [ ] Changes MySQL compatibility
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
None
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Codecov Report
Attention: Patch coverage is 83.85650% with 36 lines in your changes missing coverage. Please review.
Project coverage is 61.1317%. Comparing base (
a8a9a04) to head (b819a09). Report is 724 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #47765 +/- ##
================================================
- Coverage 70.8030% 61.1317% -9.6714%
================================================
Files 1464 1792 +328
Lines 436234 712622 +276388
================================================
+ Hits 308867 435638 +126771
- Misses 108065 251746 +143681
- Partials 19302 25238 +5936
| Flag | Coverage Δ | |
|---|---|---|
| integration | 44.2336% <71.7488%> (?) |
|
| unit | 74.8765% <82.5112%> (+4.2746%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Components | Coverage Δ | |
|---|---|---|
| dumpling | 57.8032% <ø> (+3.8075%) |
:arrow_up: |
| parser | ∅ <ø> (∅) |
|
| br | 57.5065% <ø> (+11.6041%) |
:arrow_up: |
And some codes like the in-place updates for the heap are optimized for the memory.
/retest
/retest
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: hi-rustin
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [hi-rustin]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
[LGTM Timeline notifier]
Timeline:
2024-02-22 09:58:29.271545344 +0000 UTC m=+523998.019168506: :ballot_box_with_check: agreed by hi-rustin.
/test all