dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

chore: force shutdown connections on takeover

Open kostasrim opened this issue 1 month ago • 7 comments

TakeOver algorithm as is

Imagine the simple setup: node 1 master, node 2 replica and node 2 attempts takeover on node 1.

For node 1:

Steps that affect TakeOver result/state:

  1. Switch state to TAKEN_OVER -> connections will get their commands rejected
  2. Wait for all connections to finish dispatches -> send Checkpoint message on each conn and wait for it to be processed
  • New step here from this PR. Optionally force shutdown all open connections. This does not break data parity. The connection is stuck on send, will break with an error but the transaction data will get replicated anyway (we already wrote to the journal the change).
  1. Wait for the replica to catch up (reach the same LSN as master) -> data parity
  2. Reply with OK. At this point the TAKEOVER is considered complete, node 2 is master.

Optional steps that improve other flows:

  1. Optionally: if there are more nodes registered as replicas, wait for those to catch up. This does not affect the takeover but allows waiting for all of the replicas to fully sync. (Also needed for partial sync data integrity)
  2. Optionally: Take a snapshot
  3. Optionally: for cluster mode do not shutdown such that the node can redirect requests

Notes

  • Dragonfly is still accepting new connections during the takeover. However, once dragonfly switches to TAKEN_OVER state, connections won’t be able to execute commands and they will get “busy loading” error

  • Checkpoints are needed for data integrity. Once all the connections process the checkpoints it means that any other command in their dispatch queue will fail with busy loading. Hence, after step(2) there won’t be any change in the state (storage) and after step(3) the taking over node is on data parity.

This PR:

A step between 2 and 3 to fix non responding connections and force the takeover by forcefully shutting them down.

Follows up #6135. Add a flag experimental_force_takeover. If takeover times out and this flag is set dragonfly will force shutdown open connections such that it doesn't fail the takeover.

Resolves: #6114

kostasrim avatar Dec 01 '25 13:12 kostasrim

I will work and it will be more reliable than what we currently have, I just have this feeling that we overcomplicated because of not changing the message approach. Please see the small comments

However, this case will be fixed once we switch to asynchronous execution of commands

Maybe this will be a good moment to transition to a better approach

dranikpg avatar Dec 01 '25 17:12 dranikpg

It was easiest to base this mechanism just on a new message type, because we already had the queue set up with different types.

Are you suggesting this or was it informational ? I mean, we got Checkpoint which we don't process because we are stuck. Am I missing something ?

kostasrim avatar Dec 02 '25 13:12 kostasrim

@kostasrim can you please write in the PR description the existing algorithm behind takeover? specifically, Is it a stateful flow operation where we tell listener to stop accepting new connections, mark all connections as shutting down and trying to break the existing flows or we just try to break the existing connections while new ones may be added, during this time ? Maybe it's time to revisit the algorithm and to implement something sound?

romange avatar Dec 02 '25 15:12 romange

The rationale a long time ago was really simple: once we set TAKEN_OVER state, not all connections might have finished running transactions. So to keep data integrity/operation atomicity, we have to wait for all of them to finish. Nothing more.

I read it after I wrote my comment and I approve this message :)
What is the process now and what it should actually be - let's start with high level english.

romange avatar Dec 02 '25 15:12 romange

@kostasrim can you please write in the PR description the existing algorithm behind takeover? specifically, Is it a stateful flow operation where we tell listener to stop accepting new connections, mark all connections as shutting down and trying to break the existing flows or we just try to break the existing connections while new ones may be added, during this time ?

I updated the description with the algorithm + some small explanations.

Maybe it's time to revisit the algorithm and to implement something sound?

Yes but plz read what I wrote in "thinking out loud" https://github.com/dragonflydb/dragonfly/issues/6114#issuecomment-3574709795. I would like to revisit this and I am happy to write a small doc internally on what to do but I would first wait for the changes in the connection fiber/pipelining we are implementing. Once that's in place, we will adjust/redesign accordingly. When we become fully asynchronous for example, a Checkpoint message will be processed immediately since now the connection fiber is a multiplexer of different events. However, there might be multiple in-flight commands. Obviously the interactions there are slightly different and I don't want to rush any decisions without having the asynchronicity first set in stone.

My rationale for this PR is to for now have a temporary solution and allow dragonfly to complete the takeover in an intrusive/forceful way until we slowly transition to a new approach.

kostasrim avatar Dec 03 '25 11:12 kostasrim

I added additional logs which I hope will shed more light on why takeover fails BESIDES the stuck send explanation. Btw, stuck send is relatively rare phenomena for us and we have a solution for this which should decrease the impact on takeover drastically. All in all I suggest we take it offline as it's not super urgent to solve now.

romange avatar Dec 03 '25 12:12 romange

@dranikpg hope I didn't forget anything

kostasrim avatar Dec 03 '25 12:12 kostasrim