dragonfly-operator icon indicating copy to clipboard operation
dragonfly-operator copied to clipboard

Drop connected clients when instance is no longer master

Open georgeboot opened this issue 7 months ago • 2 comments

Previously reported in https://github.com/dragonflydb/dragonfly/issues/5160

Describe the bug We encountered a situation where Dragonfly decided to switch masters (without any real reason as far as I can tell, but that's for another topic). We are using the operator in kubernetes. Our apps connect to dragonfly via the service, so Kubernetes routes the traffic to the current master.

When the master changes, open connections don't magically re-route to the new master but instead stay connected to the now no-longer-master. Because this instance is now read-only, writing to it will fail:

READONLY You can't write against a read only replica.

Most of the times a failover will occur because the old master is no longer alive, which already kills open connections. But in a "planned switchover" (or whatever Dragonfly calls this internally), the old master remains intact and connected.

To Reproduce Steps to reproduce the behavior:

  1. Deploy the operator with 2 or more replicas
  2. Have any app be connected via the service that tries to write every second.
  3. Trigger a switchover without killing the present master
  4. See error

Expected behavior If dragonfly does a role change, it should give connected clients some indication that they need to reconnect in order to talk to the new master. A simple disconnect would be very effective.

As per suggestion of @romange:

The operator can send "client kill" command

Environment (please complete the following information):

  • Containerized?: Kubernetes via Operator
  • Dragonfly Version: 1.30.0

Logs from master replica

I20250520 12:36:00.022531    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:37:00.020418    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:38:00.019855    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:39:00.018941    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:39:08.867802    11 dflycmd.cc:127] Disconnecting from replica 10.244.0.57:6379
W20250520 12:39:08.867852    11 common.cc:413] ReportError: Operation canceled: ExecutionState cancelled
I20250520 12:39:08.867894    11 dflycmd.cc:686] Replication error: Operation canceled: ExecutionState cancelled
I20250520 12:39:09.960482    11 server_family.cc:3009] Replicating 10.244.0.57:9999
I20250520 12:39:09.964658    11 replica.cc:580] Started full sync with 10.244.0.57:9999
I20250520 12:39:09.989066    11 replica.cc:600] full sync finished in 26 ms
I20250520 12:39:09.989102    11 replica.cc:690] Transitioned into stable sync
I20250520 12:40:00.019615    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:41:00.019866    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:42:00.017843    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:43:00.018100    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
W20250520 12:43:28.733081    11 script_mgr.cc:328] Error running script (call to a08c9c69b7c07ae2485190873b90d128a23e502d): @user_script:2: -READONLY You can't write against a read only replica.
W20250520 12:43:28.733228    11 main_service.cc:1363]  EVAL return redis.call(\'exists\',KEYS[1])<1 and redis.call(\'setex\',KEYS[1],ARGV[2],ARGV[1]) 1 ";N;} 900 failed with reason: Error running script (call to a08c9c69b7c07ae2485190873b90d128a23e502d): @user_script:2: -READONLY You can't write against a read only replica.
W20250520 12:43:36.952080    11 script_mgr.cc:328] Error running script (call to a08c9c69b7c07ae2485190873b90d128a23e502d): @user_script:2: -READONLY You can't write against a read only replica.
W20250520 12:43:36.952189    11 main_service.cc:1363]  EVAL return redis.call(\'exists\',KEYS[1])<1 and redis.call(\'setex\',KEYS[1],ARGV[2],ARGV[1]) 1 ";N;} 900 failed with reason: Error running script (call to a08c9c69b7c07ae2485190873b90d128a23e502d): @user_script:2: -READONLY You can't write against a read only replica.
I20250520 12:44:00.018446    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:45:00.018730    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:46:00.019562    11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s

Related: #238

georgeboot avatar May 30 '25 17:05 georgeboot

Any updates on this, and are there any workarounds? We can't use this in production without a solid solution (this issue just took down production 😛).

XLordalX avatar Jul 07 '25 15:07 XLordalX

The team is busy with other tasks. If someone would debug the issue on their side (operator logs etc) and submit the fix to the operator we would review such PR.

romange avatar Jul 08 '25 02:07 romange