liftbridge
liftbridge copied to clipboard
Progressive shutdown - shedding
Currently, when a Liftbridge server shuts down it stops being a leader for its partitions. If many partitions exist that will result in a flurry of Raft events. Would it be possible to trigger a progressive shutdown to prevent this? Have you had some thought about this @tylertreat?
Yes, this is something I've thought a bit about, especially as it relates to rolling cluster upgrades. I think a graceful shutdown would make sense. There would be a few components to this:
- If the server is leader for any partitions, transfer leadership to another replica (invoke a
ChangeLeaderOp
in Raft) and remove self from ISR (ShrinkISROp
). This should be down gradually to avoid a flood of Raft ops. Also interrupt any clients currently subscribed. - If server is follower for any partitions, remove self from ISR (
ShrinkISROp
). This should be done gradually to avoid a flood of Raft ops. Also interrupt any clients currently subscribed. - At this point, probably reject any client requests, e.g. publish or subscribe.
- If the server shutting down is the metadata leader, transfer leadership to another node. Perform a Raft barrier to ensure all preceding Raft ops have been applied.
- Remove self from Raft group. Need to think through how this works when rejoining, e.g. in the case of restarting/upgrading a node.
- Shut down the server.