nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

NRG (2.11): Remove stepdown channel, handle inline

Open neilalexander opened this issue 1 year ago • 2 comments

The stepdown channel interleaves with other channels such as the apply queue, leader change notifications etc in the runAs goroutines in an unpredictable order, so processing a stepdown request might be delayed behind other work. Doing this inline should be safer with stronger guarantees.


We originally raised this in #4990 (in v2.10.11) and then reverted it in #5200 (in v2.10.12) as it was slowing down some asset moves. This was also visible in the TestJetStreamSuperClusterDoubleStreamMove and TestJetStreamSuperClusterMovingStreamAndMoveBack unit tests.

We believe that the election timer being too eagerly reset was responsible for a slowdown in establishing a new leader and therefore think that the changes in #5315 will have helped here. Both unit tests don't show any particularly different variance in runtimes compared to main.

Signed-off-by: Neil Twigg [email protected]

neilalexander avatar Apr 24 '24 10:04 neilalexander

This is a correctness change. Right now, because different types of work are queued up in IP queues and handled by the run goroutine in an indeterminate order, it's possible that either the processing of the stepdown is delayed, or the run goroutine doesn't switch state away from running as a leader when it should. I am not 100% sure how to unit test this yet but we should avoid having more asynchronous surfaces than necessary (particularly for state changes) in the NRG code.

neilalexander avatar Jun 28 '24 17:06 neilalexander

My point is I think we need a test here to show the bad behavior and that this change fixes it..

derekcollison avatar Jun 28 '24 17:06 derekcollison

I've added a unit test that proves that once StepDown() has returned, that the state is no longer a leader state. This currently fails usually on the first try on main.

neilalexander avatar Jul 16 '24 14:07 neilalexander

Fixed merge conflicts after other PRs merged.

neilalexander avatar Jul 22 '24 15:07 neilalexander