pd icon indicating copy to clipboard operation
pd copied to clipboard

Operator might be ended with the wrong status

Open rleungx opened this issue 1 year ago • 0 comments

Enhancement Task

https://github.com/tikv/pd/blob/d6d9feab3e2a5180acbcc7095723d43e97798686/pkg/schedule/operator/operator_controller.go#L107-L235

If line 232 is executed first, the operator may be changed to timeout status since it indeed exceeds the max execution time if there are pending heartbeats. When calling line 116, it will be skipped because of the timeout status. But actually, it could be executed successfully on TiKV side.

Here is an example:

Feb 21, 2024 @ 12:14:17.011 [operator_controller.go:443] ["add operator"] [region-id=1199980] [operator="\"fix-peer-role {promote peer: store [2803711]} (kind:unknown, region:1199980(5218, 3189), createAt:2024-02-21 12:14:17.00619947 +0800 CST m=+70594.374713221, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, size:56, steps:[0:{promote learner peer 430111713 on store 2803711 to voter}],timeout:[1m0s])\""] [additional-info=]
Feb 21, 2024 @ 12:15:19.462 [region.go:645] ["region ConfVer changed"] [region-id=1199980] [detail="Remove peer:{id:430111713 store_id:2803711 role:Learner },Add peer:{id:430111713 store_id:2803711 }"] [old-confver=3189] [new-confver=3190]
Feb 21, 2024 @ 12:15:19.468 [operator_controller.go:580] ["operator timeout"] [region-id=1199980] [takes=1m2.45689524s] [operator="\"fix-peer-role {promote peer: store [2803711]} (kind:unknown, region:1199980(5218, 3189), createAt:2024-02-21 12:14:17.00619947 +0800 CST m=+70594.374713221, startAt:2024-02-21 12:14:17.01159741 +0800 CST m=+70594.380111161, currentStep:0, size:56, steps:[0:{promote learner peer 430111713 on store 2803711 to voter}],timeout:[1m0s]) timeout\""] [additional-info=]

rleungx avatar Feb 22 '24 07:02 rleungx