pd
pd copied to clipboard
Operator might be ended with the wrong status
Enhancement Task
https://github.com/tikv/pd/blob/d6d9feab3e2a5180acbcc7095723d43e97798686/pkg/schedule/operator/operator_controller.go#L107-L235
If line 232 is executed first, the operator may be changed to timeout status since it indeed exceeds the max execution time if there are pending heartbeats. When calling line 116, it will be skipped because of the timeout status. But actually, it could be executed successfully on TiKV side.
Here is an example:
Feb 21, 2024 @ 12:14:17.011 [operator_controller.go:443] ["add operator"] [region-id=1199980] [operator="\"fix-peer-role {promote peer: store [2803711]} (kind:unknown, region:1199980(5218, 3189), createAt:2024-02-21 12:14:17.00619947 +0800 CST m=+70594.374713221, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:56, steps:[0:{promote learner peer 430111713 on store 2803711 to voter}],timeout:[1m0s])\""] [additional-info=]
Feb 21, 2024 @ 12:15:19.462 [region.go:645] ["region ConfVer changed"] [region-id=1199980] [detail="Remove peer:{id:430111713 store_id:2803711 role:Learner },Add peer:{id:430111713 store_id:2803711 }"] [old-confver=3189] [new-confver=3190]
Feb 21, 2024 @ 12:15:19.468 [operator_controller.go:580] ["operator timeout"] [region-id=1199980] [takes=1m2.45689524s] [operator="\"fix-peer-role {promote peer: store [2803711]} (kind:unknown, region:1199980(5218, 3189), createAt:2024-02-21 12:14:17.00619947 +0800 CST m=+70594.374713221, startAt:2024-02-21 12:14:17.01159741 +0800 CST m=+70594.380111161, currentStep:0, size:56, steps:[0:{promote learner peer 430111713 on store 2803711 to voter}],timeout:[1m0s]) timeout\""] [additional-info=]