nats-server
nats-server copied to clipboard
NRG (2.11): Remove stepdown channel, handle inline
The stepdown channel interleaves with other channels such as the apply queue, leader change notifications etc in the runAs goroutines in an unpredictable order, so processing a stepdown request might be delayed behind other work. Doing this inline should be safer with stronger guarantees.
We originally raised this in #4990 (in v2.10.11) and then reverted it in #5200 (in v2.10.12) as it was slowing down some asset moves. This was also visible in the TestJetStreamSuperClusterDoubleStreamMove and TestJetStreamSuperClusterMovingStreamAndMoveBack unit tests.
We believe that the election timer being too eagerly reset was responsible for a slowdown in establishing a new leader and therefore think that the changes in #5315 will have helped here. Both unit tests don't show any particularly different variance in runtimes compared to main.
Signed-off-by: Neil Twigg [email protected]
This is a correctness change. Right now, because different types of work are queued up in IP queues and handled by the run goroutine in an indeterminate order, it's possible that either the processing of the stepdown is delayed, or the run goroutine doesn't switch state away from running as a leader when it should. I am not 100% sure how to unit test this yet but we should avoid having more asynchronous surfaces than necessary (particularly for state changes) in the NRG code.
My point is I think we need a test here to show the bad behavior and that this change fixes it..
I've added a unit test that proves that once StepDown() has returned, that the state is no longer a leader state. This currently fails usually on the first try on main.
Fixed merge conflicts after other PRs merged.