Drop connected clients when instance is no longer master
Previously reported in https://github.com/dragonflydb/dragonfly/issues/5160
Describe the bug We encountered a situation where Dragonfly decided to switch masters (without any real reason as far as I can tell, but that's for another topic). We are using the operator in kubernetes. Our apps connect to dragonfly via the service, so Kubernetes routes the traffic to the current master.
When the master changes, open connections don't magically re-route to the new master but instead stay connected to the now no-longer-master. Because this instance is now read-only, writing to it will fail:
READONLY You can't write against a read only replica.
Most of the times a failover will occur because the old master is no longer alive, which already kills open connections. But in a "planned switchover" (or whatever Dragonfly calls this internally), the old master remains intact and connected.
To Reproduce Steps to reproduce the behavior:
- Deploy the operator with 2 or more replicas
- Have any app be connected via the service that tries to write every second.
- Trigger a switchover without killing the present master
- See error
Expected behavior If dragonfly does a role change, it should give connected clients some indication that they need to reconnect in order to talk to the new master. A simple disconnect would be very effective.
As per suggestion of @romange:
The operator can send "client kill" command
Environment (please complete the following information):
- Containerized?: Kubernetes via Operator
- Dragonfly Version: 1.30.0
Logs from master replica
I20250520 12:36:00.022531 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:37:00.020418 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:38:00.019855 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:39:00.018941 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:39:08.867802 11 dflycmd.cc:127] Disconnecting from replica 10.244.0.57:6379
W20250520 12:39:08.867852 11 common.cc:413] ReportError: Operation canceled: ExecutionState cancelled
I20250520 12:39:08.867894 11 dflycmd.cc:686] Replication error: Operation canceled: ExecutionState cancelled
I20250520 12:39:09.960482 11 server_family.cc:3009] Replicating 10.244.0.57:9999
I20250520 12:39:09.964658 11 replica.cc:580] Started full sync with 10.244.0.57:9999
I20250520 12:39:09.989066 11 replica.cc:600] full sync finished in 26 ms
I20250520 12:39:09.989102 11 replica.cc:690] Transitioned into stable sync
I20250520 12:40:00.019615 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:41:00.019866 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:42:00.017843 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:43:00.018100 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
W20250520 12:43:28.733081 11 script_mgr.cc:328] Error running script (call to a08c9c69b7c07ae2485190873b90d128a23e502d): @user_script:2: -READONLY You can't write against a read only replica.
W20250520 12:43:28.733228 11 main_service.cc:1363] EVAL return redis.call(\'exists\',KEYS[1])<1 and redis.call(\'setex\',KEYS[1],ARGV[2],ARGV[1]) 1 ";N;} 900 failed with reason: Error running script (call to a08c9c69b7c07ae2485190873b90d128a23e502d): @user_script:2: -READONLY You can't write against a read only replica.
W20250520 12:43:36.952080 11 script_mgr.cc:328] Error running script (call to a08c9c69b7c07ae2485190873b90d128a23e502d): @user_script:2: -READONLY You can't write against a read only replica.
W20250520 12:43:36.952189 11 main_service.cc:1363] EVAL return redis.call(\'exists\',KEYS[1])<1 and redis.call(\'setex\',KEYS[1],ARGV[2],ARGV[1]) 1 ";N;} 900 failed with reason: Error running script (call to a08c9c69b7c07ae2485190873b90d128a23e502d): @user_script:2: -READONLY You can't write against a read only replica.
I20250520 12:44:00.018446 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:45:00.018730 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
I20250520 12:46:00.019562 11 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/snapshot-summary.dfs" finished after 1 s
Related: #238
Any updates on this, and are there any workarounds? We can't use this in production without a solid solution (this issue just took down production 😛).
The team is busy with other tasks. If someone would debug the issue on their side (operator logs etc) and submit the fix to the operator we would review such PR.