nats-server NRG (2.11): Remove stepdown channel, handle inline

NRG (2.11): Remove stepdown channel, handle inline

Open neilalexander opened this issue 1 year ago • 2 comments

The stepdown channel interleaves with other channels such as the apply queue, leader change notifications etc in the runAs goroutines in an unpredictable order, so processing a stepdown request might be delayed behind other work. Doing this inline should be safer with stronger guarantees.

We originally raised this in #4990 (in v2.10.11) and then reverted it in #5200 (in v2.10.12) as it was slowing down some asset moves. This was also visible in the TestJetStreamSuperClusterDoubleStreamMove and TestJetStreamSuperClusterMovingStreamAndMoveBack unit tests.

We believe that the election timer being too eagerly reset was responsible for a slowdown in establishing a new leader and therefore think that the changes in #5315 will have helped here. Both unit tests don't show any particularly different variance in runtimes compared to main.

Signed-off-by: Neil Twigg [email protected]

Apr 24 '24 10:04 neilalexander

This is a correctness change. Right now, because different types of work are queued up in IP queues and handled by the run goroutine in an indeterminate order, it's possible that either the processing of the stepdown is delayed, or the run goroutine doesn't switch state away from running as a leader when it should. I am not 100% sure how to unit test this yet but we should avoid having more asynchronous surfaces than necessary (particularly for state changes) in the NRG code.

Jun 28 '24 17:06 neilalexander

My point is I think we need a test here to show the bad behavior and that this change fixes it..

Jun 28 '24 17:06 derekcollison

I've added a unit test that proves that once StepDown() has returned, that the state is no longer a leader state. This currently fails usually on the first try on main.

Jul 16 '24 14:07 neilalexander

Fixed merge conflicts after other PRs merged.

Jul 22 '24 15:07 neilalexander

nats-server nats-server copied to clipboard

NRG (2.11): Remove stepdown channel, handle inline

nats-server
nats-server copied to clipboard