liftbridge Progressive shutdown

Progressive shutdown - shedding

Open Jmgr opened this issue 4 years ago • 1 comments

Currently, when a Liftbridge server shuts down it stops being a leader for its partitions. If many partitions exist that will result in a flurry of Raft events. Would it be possible to trigger a progressive shutdown to prevent this? Have you had some thought about this @tylertreat?

Jan 20 '21 14:01 Jmgr

Yes, this is something I've thought a bit about, especially as it relates to rolling cluster upgrades. I think a graceful shutdown would make sense. There would be a few components to this:

If the server is leader for any partitions, transfer leadership to another replica (invoke a ChangeLeaderOp in Raft) and remove self from ISR (ShrinkISROp). This should be down gradually to avoid a flood of Raft ops. Also interrupt any clients currently subscribed.
If server is follower for any partitions, remove self from ISR (ShrinkISROp). This should be done gradually to avoid a flood of Raft ops. Also interrupt any clients currently subscribed.
At this point, probably reject any client requests, e.g. publish or subscribe.
If the server shutting down is the metadata leader, transfer leadership to another node. Perform a Raft barrier to ensure all preceding Raft ops have been applied.
Remove self from Raft group. Need to think through how this works when rejoining, e.g. in the case of restarting/upgrading a node.
Shut down the server.

Jan 20 '21 21:01 tylertreat

liftbridge liftbridge copied to clipboard

Progressive shutdown - shedding

liftbridge
liftbridge copied to clipboard