chore: force shutdown connections on takeover
TakeOver algorithm as is
Imagine the simple setup: node 1 master, node 2 replica and node 2 attempts takeover on node 1.
For node 1:
Steps that affect TakeOver result/state:
- Switch state to TAKEN_OVER -> connections will get their commands rejected
- Wait for all connections to finish dispatches -> send Checkpoint message on each conn and wait for it to be processed
- New step here from this PR. Optionally force shutdown all open connections. This does not break data parity. The connection is stuck on send, will break with an error but the transaction data will get replicated anyway (we already wrote to the journal the change).
- Wait for the replica to catch up (reach the same LSN as master) -> data parity
- Reply with OK. At this point the TAKEOVER is considered complete, node 2 is master.
Optional steps that improve other flows:
- Optionally: if there are more nodes registered as replicas, wait for those to catch up. This does not affect the takeover but allows waiting for all of the replicas to fully sync. (Also needed for partial sync data integrity)
- Optionally: Take a snapshot
- Optionally: for cluster mode do not shutdown such that the node can redirect requests
Notes
-
Dragonfly is still accepting new connections during the takeover. However, once dragonfly switches to TAKEN_OVER state, connections won’t be able to execute commands and they will get “busy loading” error
-
Checkpoints are needed for data integrity. Once all the connections process the checkpoints it means that any other command in their dispatch queue will fail with busy loading. Hence, after step(2) there won’t be any change in the state (storage) and after step(3) the taking over node is on data parity.
This PR:
A step between 2 and 3 to fix non responding connections and force the takeover by forcefully shutting them down.
Follows up #6135. Add a flag experimental_force_takeover. If takeover times out and this flag is set dragonfly will force shutdown open connections such that it doesn't fail the takeover.
Resolves: #6114
I will work and it will be more reliable than what we currently have, I just have this feeling that we overcomplicated because of not changing the message approach. Please see the small comments
However, this case will be fixed once we switch to asynchronous execution of commands
Maybe this will be a good moment to transition to a better approach
It was easiest to base this mechanism just on a new message type, because we already had the queue set up with different types.
Are you suggesting this or was it informational ? I mean, we got Checkpoint which we don't process because we are stuck. Am I missing something ?
@kostasrim can you please write in the PR description the existing algorithm behind takeover? specifically, Is it a stateful flow operation where we tell listener to stop accepting new connections, mark all connections as shutting down and trying to break the existing flows or we just try to break the existing connections while new ones may be added, during this time ? Maybe it's time to revisit the algorithm and to implement something sound?
The rationale a long time ago was really simple: once we set TAKEN_OVER state, not all connections might have finished running transactions. So to keep data integrity/operation atomicity, we have to wait for all of them to finish. Nothing more.
I read it after I wrote my comment and I approve this message :)
What is the process now and what it should actually be - let's start with high level english.
@kostasrim can you please write in the PR description the existing algorithm behind takeover? specifically, Is it a stateful flow operation where we tell listener to stop accepting new connections, mark all connections as shutting down and trying to break the existing flows or we just try to break the existing connections while new ones may be added, during this time ?
I updated the description with the algorithm + some small explanations.
Maybe it's time to revisit the algorithm and to implement something sound?
Yes but plz read what I wrote in "thinking out loud" https://github.com/dragonflydb/dragonfly/issues/6114#issuecomment-3574709795. I would like to revisit this and I am happy to write a small doc internally on what to do but I would first wait for the changes in the connection fiber/pipelining we are implementing. Once that's in place, we will adjust/redesign accordingly. When we become fully asynchronous for example, a Checkpoint message will be processed immediately since now the connection fiber is a multiplexer of different events. However, there might be multiple in-flight commands. Obviously the interactions there are slightly different and I don't want to rush any decisions without having the asynchronicity first set in stone.
My rationale for this PR is to for now have a temporary solution and allow dragonfly to complete the takeover in an intrusive/forceful way until we slowly transition to a new approach.
I added additional logs which I hope will shed more light on why takeover fails BESIDES the stuck send explanation. Btw, stuck send is relatively rare phenomena for us and we have a solution for this which should decrease the impact on takeover drastically. All in all I suggest we take it offline as it's not super urgent to solve now.
@dranikpg hope I didn't forget anything