merge region reach limit while there is no any operator
Bug Report
What did you do?
merge empty region
What did you expect to see?
What did you see instead?
there has been no operator check, but it still reaches limit
What version of PD are you using (pd-server -V)?
I am not sure whether it is related to evict-slow-store-scheduler
I try to adjust merge-schedule-limit and remove evict-slow-store-scheduler, it works again.
any metrics on TIKV side can be checked and reviewed ?
It occurs again, there is no any merge operator, but it shows schedulers reach limit until merge-scheduler-limit is set to 512
I guess there maybe a statistic problem with the operator count. There are some retain operators in PD.
I try add log, and found operator count is error
if c.mergeChecker != nil {
allowed := opController.OperatorCount(operator.OpMerge) < c.conf.GetMergeScheduleLimit()
if !allowed {
log.Debug("merge check allowed", zap.Int("operators", len(opController.GetOperators())),
zap.Uint64("count", opController.OperatorCount(operator.OpMerge)), zap.Uint64("limit", c.conf.GetMergeScheduleLimit()))
operator.IncOperatorLimitCounter(c.mergeChecker.GetType(), operator.OpMerge)
} else if ops := c.mergeChecker.Check(region); ops != nil {
// It makes sure that two operators can be added successfully altogether.
return ops
}
}
The problem is that promoting the operator may happen at the same time. The load can pass, but when storing the operator, there will be an override. See https://github.com/tikv/pd/blob/8a19f46f377ad8e88d7b80b7c8de8056a3188cc6/pkg/schedule/operator/operator_controller.go#L511-L538
Line 520 might pass, but line 538 will cause the operator to be overridden. Then the counter will be wrong. Finally, it will result in the operator reaching the limit.
[2025/05/05 21:31:38.044 +08:00] [INFO] [operator_controller.go:510] ["add operator"] [region-id=1287780297] [operator="\"merge-region {merge: region 1287780297 to 1287780428} (kind:merge, region:1287780297(46914674, 35375), createAt:2025-05-05 21:31:38.044473516 +0800 CST m=+4497437.839604985, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:1, steps:[0:{merge region 1287780297 into region 1287780428}], timeout:[10m0s])\""] [additional-info="{\"related merge region\":\"1287780428\"}"]
[2025/05/05 21:31:38.044 +08:00] [INFO] [operator_controller.go:510] ["add operator"] [region-id=1287780428] [operator="\"merge-region {merge: region 1287780297 to 1287780428} (kind:merge, region:1287780428(46914695, 35375), createAt:2025-05-05 21:31:38.044475644 +0800 CST m=+4497437.839607114, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:1, steps:[0:{merge region 1287780297 into region 1287780428}], timeout:[10m0s])\""] [additional-info="{\"related merge region\":\"1287780297\"}"]
[2025/05/05 21:31:38.044 +08:00] [INFO] [operator_controller.go:510] ["add operator"] [region-id=1287780428] [operator="\"merge-region {merge: region 1287780428 to 1287780432} (kind:merge, region:1287780428(46914695, 35375), createAt:2025-05-05 21:31:38.044549893 +0800 CST m=+4497437.839681364, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:1, steps:[0:{merge region 1287780428 into region 1287780432}], timeout:[10m0s])\""] [additional-info="{\"related merge region\":\"1287780432\"}"]
[2025/05/05 21:31:38.044 +08:00] [INFO] [operator_controller.go:510] ["add operator"] [region-id=1287780432] [operator="\"merge-region {merge: region 1287780428 to 1287780432} (kind:merge, region:1287780432(46914695, 35375), createAt:2025-05-05 21:31:38.044550709 +0800 CST m=+4497437.839682179, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:1, steps:[0:{merge region 1287780428 into region 1287780432}], timeout:[10m0s])\""] [additional-info="{\"related merge region\":\"1287780428\"}"]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:667] ["operator canceled related merge region"] [region-id=1287780428] [additional-info="{\"related merge region\":\"1287780297\"}"] [takes=2.89834ms]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:667] ["operator canceled related merge region"] [region-id=1287780297] [additional-info="{\"related merge region\":\"1287780428\"}"] [takes=2.982699ms]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:725] ["operator canceled"] [region-id=1287780297] [takes=3.000114ms] [operator="\"merge-region {merge: region 1287780297 to 1287780428} (kind:merge, region:1287780297(46914674, 35375), createAt:2025-05-05 21:31:38.044473516 +0800 CST m=+4497437.839604985, startAt:2025-05-05 21:31:38.044515398 +0800 CST m=+4497437.839646868, currentStep:0, size:1, steps:[0:{merge region 1287780297 into region 1287780428}], timeout:[10m0s])\""] [additional-info="{\"cancel-reason\":\"related merge region\",\"related merge region\":\"1287780428\"}"]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:725] ["operator canceled"] [region-id=1287780428] [takes=2.98589ms] [operator="\"merge-region {merge: region 1287780297 to 1287780428} (kind:merge, region:1287780428(46914695, 35375), createAt:2025-05-05 21:31:38.044475644 +0800 CST m=+4497437.839607114, startAt:2025-05-05 21:31:38.044559017 +0800 CST m=+4497437.839690487, currentStep:0, size:1, steps:[0:{merge region 1287780297 into region 1287780428}], timeout:[10m0s])\""] [additional-info="{\"cancel-reason\":\"related merge region\",\"related merge region\":\"1287780297\"}"]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:694] ["operator finish"] [region-id=1287780432] [takes=2.917532ms] [operator="\"merge-region {merge: region 1287780428 to 1287780432} (kind:merge, region:1287780432(46914695, 35375), createAt:2025-05-05 21:31:38.044550709 +0800 CST m=+4497437.839682179, startAt:2025-05-05 21:31:38.044647754 +0800 CST m=+4497437.839779224, currentStep:1, size:1, steps:[0:{merge region 1287780428 into region 1287780432}], timeout:[10m0s]) finished\""] [additional-info="{\"cancel-reason\":\"\",\"related merge region\":\"1287780428\"}"]
For example, 4 operators were added, and only 3 of them were canceled or finished. The issue is introduced by https://github.com/tikv/pd/pull/8032. It will be triggered if A is going to be merged to B and at the same time B is going to be merged to C.
Regression Analysis PR caused this regression: https://github.com/tikv/pd/pull/8032 Related to issue/feature: https://github.com/tikv/pd/issues/7897