pd icon indicating copy to clipboard operation
pd copied to clipboard

merge region reach limit while there is no any operator

Open lhy1024 opened this issue 1 year ago • 4 comments

Bug Report

What did you do?

merge empty region

What did you expect to see?

What did you see instead?

there has been no operator check, but it still reaches limit

image

image

What version of PD are you using (pd-server -V)?

lhy1024 avatar Jul 26 '24 11:07 lhy1024

I am not sure whether it is related to evict-slow-store-scheduler

I try to adjust merge-schedule-limit and remove evict-slow-store-scheduler, it works again.

lhy1024 avatar Jul 26 '24 11:07 lhy1024

any metrics on TIKV side can be checked and reviewed ?

LykxSassinator avatar Aug 12 '24 09:08 LykxSassinator

It occurs again, there is no any merge operator, but it shows schedulers reach limit until merge-scheduler-limit is set to 512

img_v3_02e6_38502603-0e43-4763-bb40-dc3dc273457g

I guess there maybe a statistic problem with the operator count. There are some retain operators in PD.

lhy1024 avatar Aug 28 '24 06:08 lhy1024

Image Image

I try add log, and found operator count is error

if c.mergeChecker != nil {
		allowed := opController.OperatorCount(operator.OpMerge) < c.conf.GetMergeScheduleLimit()
		if !allowed {
			log.Debug("merge check allowed", zap.Int("operators", len(opController.GetOperators())),
				zap.Uint64("count", opController.OperatorCount(operator.OpMerge)), zap.Uint64("limit", c.conf.GetMergeScheduleLimit()))
			operator.IncOperatorLimitCounter(c.mergeChecker.GetType(), operator.OpMerge)
		} else if ops := c.mergeChecker.Check(region); ops != nil {
			// It makes sure that two operators can be added successfully altogether.
			return ops
		}
	}

Image

lhy1024 avatar Dec 03 '24 15:12 lhy1024

The problem is that promoting the operator may happen at the same time. The load can pass, but when storing the operator, there will be an override. See https://github.com/tikv/pd/blob/8a19f46f377ad8e88d7b80b7c8de8056a3188cc6/pkg/schedule/operator/operator_controller.go#L511-L538

Line 520 might pass, but line 538 will cause the operator to be overridden. Then the counter will be wrong. Finally, it will result in the operator reaching the limit.

rleungx avatar May 06 '25 08:05 rleungx

[2025/05/05 21:31:38.044 +08:00] [INFO] [operator_controller.go:510] ["add operator"] [region-id=1287780297] [operator="\"merge-region {merge: region 1287780297 to 1287780428} (kind:merge, region:1287780297(46914674, 35375), createAt:2025-05-05 21:31:38.044473516 +0800 CST m=+4497437.839604985, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:1, steps:[0:{merge region 1287780297 into region 1287780428}], timeout:[10m0s])\""] [additional-info="{\"related merge region\":\"1287780428\"}"]
[2025/05/05 21:31:38.044 +08:00] [INFO] [operator_controller.go:510] ["add operator"] [region-id=1287780428] [operator="\"merge-region {merge: region 1287780297 to 1287780428} (kind:merge, region:1287780428(46914695, 35375), createAt:2025-05-05 21:31:38.044475644 +0800 CST m=+4497437.839607114, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:1, steps:[0:{merge region 1287780297 into region 1287780428}], timeout:[10m0s])\""] [additional-info="{\"related merge region\":\"1287780297\"}"]
[2025/05/05 21:31:38.044 +08:00] [INFO] [operator_controller.go:510] ["add operator"] [region-id=1287780428] [operator="\"merge-region {merge: region 1287780428 to 1287780432} (kind:merge, region:1287780428(46914695, 35375), createAt:2025-05-05 21:31:38.044549893 +0800 CST m=+4497437.839681364, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:1, steps:[0:{merge region 1287780428 into region 1287780432}], timeout:[10m0s])\""] [additional-info="{\"related merge region\":\"1287780432\"}"]
[2025/05/05 21:31:38.044 +08:00] [INFO] [operator_controller.go:510] ["add operator"] [region-id=1287780432] [operator="\"merge-region {merge: region 1287780428 to 1287780432} (kind:merge, region:1287780432(46914695, 35375), createAt:2025-05-05 21:31:38.044550709 +0800 CST m=+4497437.839682179, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:1, steps:[0:{merge region 1287780428 into region 1287780432}], timeout:[10m0s])\""] [additional-info="{\"related merge region\":\"1287780428\"}"]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:667] ["operator canceled related merge region"] [region-id=1287780428] [additional-info="{\"related merge region\":\"1287780297\"}"] [takes=2.89834ms]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:667] ["operator canceled related merge region"] [region-id=1287780297] [additional-info="{\"related merge region\":\"1287780428\"}"] [takes=2.982699ms]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:725] ["operator canceled"] [region-id=1287780297] [takes=3.000114ms] [operator="\"merge-region {merge: region 1287780297 to 1287780428} (kind:merge, region:1287780297(46914674, 35375), createAt:2025-05-05 21:31:38.044473516 +0800 CST m=+4497437.839604985, startAt:2025-05-05 21:31:38.044515398 +0800 CST m=+4497437.839646868, currentStep:0, size:1, steps:[0:{merge region 1287780297 into region 1287780428}], timeout:[10m0s])\""] [additional-info="{\"cancel-reason\":\"related merge region\",\"related merge region\":\"1287780428\"}"]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:725] ["operator canceled"] [region-id=1287780428] [takes=2.98589ms] [operator="\"merge-region {merge: region 1287780297 to 1287780428} (kind:merge, region:1287780428(46914695, 35375), createAt:2025-05-05 21:31:38.044475644 +0800 CST m=+4497437.839607114, startAt:2025-05-05 21:31:38.044559017 +0800 CST m=+4497437.839690487, currentStep:0, size:1, steps:[0:{merge region 1287780297 into region 1287780428}], timeout:[10m0s])\""] [additional-info="{\"cancel-reason\":\"related merge region\",\"related merge region\":\"1287780297\"}"]
[2025/05/05 21:31:38.047 +08:00] [INFO] [operator_controller.go:694] ["operator finish"] [region-id=1287780432] [takes=2.917532ms] [operator="\"merge-region {merge: region 1287780428 to 1287780432} (kind:merge, region:1287780432(46914695, 35375), createAt:2025-05-05 21:31:38.044550709 +0800 CST m=+4497437.839682179, startAt:2025-05-05 21:31:38.044647754 +0800 CST m=+4497437.839779224, currentStep:1, size:1, steps:[0:{merge region 1287780428 into region 1287780432}], timeout:[10m0s]) finished\""] [additional-info="{\"cancel-reason\":\"\",\"related merge region\":\"1287780428\"}"]

For example, 4 operators were added, and only 3 of them were canceled or finished. The issue is introduced by https://github.com/tikv/pd/pull/8032. It will be triggered if A is going to be merged to B and at the same time B is going to be merged to C.

rleungx avatar May 06 '25 08:05 rleungx

Regression Analysis PR caused this regression: https://github.com/tikv/pd/pull/8032 Related to issue/feature: https://github.com/tikv/pd/issues/7897

jebter avatar Jul 04 '25 09:07 jebter