The Dragonfly master does not switch when a node fails in Kubernetes.
Describe the bug There is Dragonfly running in Kubernetes (1 master, 2 replicas). The second time during issue with node, the master does not failover to the slave.
To Reproduce Steps to reproduce the behavior:
- Run dragonfly in k8s
- Check in which node dragonfly master is running
- Shutdown the node
Expected behavior New master should be elected, if the master is unavailable.
Environment:
- Workers OS: Ubuntu 22.04.4 LTS
- Workers Kernel: 5.15.0-105-generic
- Containerized: Kubernetes v1.29.2
- dragonfly v1.26.2-096fde172300de91850c42dab24aa09ffee254d0
- image: docker.dragonflydb.io/dragonflydb/operator:v1.1.9
Reproducible Code Snippet
k get po -l master
shutdown now # node where pod with master is running
# wait
k get po -l master # (no master)
Additional context
Master pod in pending state:
dragonfly-foo-0 0/1 Pending 0 54m
dragonfly-foo-1 1/1 Running 0 22d
dragonfly-foo-2 1/1 Running 0 22d
Info from slaves:
root@dragonfly-foo-1:/data# redis-cli
127.0.0.1:6379> info replication
# Replication
role:slave
master_host:172.20.31.160
master_port:9999
master_link_status:down
master_last_io_seconds_ago:2809
master_sync_in_progress:0
master_replid:779aexec8fa961f6e7a431b46bb37ff4d86c41bc
slave_priority:100
slave_read_only:1
root@dragonfly-foo-2:/data# redis-cli
127.0.0.1:6379> info replication
# Replication
role:slave
master_host:172.20.31.160
master_port:9999
master_link_status:down
master_last_io_seconds_ago:2829
master_sync_in_progress:0
master_replid:779aexec8fa961f6e7a431b46bb37ff4d86c41bc
slave_priority:100
slave_read_only:1
And in the pod logs just:
W20250311 14:38:34.344835 11 replica.cc:217] Error connecting to 172.20.31.160:9999 system:125
W20250311 14:38:35.845142 11 replica.cc:217] Error connecting to 172.20.31.160:9999 system:125
same issue here..
makes the "HA" setup kind of useless
Dragonfly Operator doesn't support cluster. Can you try with cluster mode off and check if you can reproduce? Also please share the operator logs here.
@Abhra303 at least im running in cluster_mode=emulated
is this also not supported?
and if not.. how would a "real" HA setup look like using the operator if no cluster_modes are not supported? i really can't find anything in the docs
@Abhra303 we are not using --cluster-mode flag at all (I removed cluster mode from first post, to avoid misunderstanding):
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
namespace: default
labels:
...
name: dragonfly-foo
spec:
args:
- "--dbfilename=dump"
replicas: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- dragonfly-foo
topologyKey: "kubernetes.io/hostname"
tolerations:
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 10
snapshot:
cron: "*/5 * * * *"
persistentVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
@Abhra303 at least im running in
cluster_mode=emulatedis this also not supported?
That is supported.
@Abhra303 we are not using
--cluster-modeflag at all:
Configuration looks good. Could you share the operator logs?
This time I was able to reproduce it in Minikube, but it's behaving a bit differently (switching after couple of minutes, and master pod in Terminating status, instead of Pending as before)
To fully reproduce it on bare-metal machines, I'll need some time before I can do that.
Logs from operator (minikube):
2025-03-24T09:32:18Z INFO getting all pods relevant to the instance {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "1d8a054a-9348-485d-9824-f7ae333b368c"}
2025-03-24T09:32:18Z ERROR couldn't find healthy and mark active {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "1d8a054a-9348-485d-9824-f7ae333b368c", "error": "Operation cannot be fulfilled on pods \"dragonfly-foo-0\": StorageError: invalid object, Code: 4, Key: /registry/pods/default/dragonfly-foo-0, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 23497fd7-ad07-4ddc-a700-b195782ee61d, UID in object meta: "}
github.com/dragonflydb/dragonfly-operator/internal/controller.(*DfPodLifeCycleReconciler).Reconcile
/workspace/internal/controller/dragonfly_pod_lifecycle_controller.go:165
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222
2025-03-24T09:32:18Z INFO Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "1d8a054a-9348-485d-9824-f7ae333b368c"}
2025-03-24T09:32:18Z ERROR Reconciler error {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "1d8a054a-9348-485d-9824-f7ae333b368c", "error": "Operation cannot be fulfilled on pods \"dragonfly-foo-0\": StorageError: invalid object, Code: 4, Key: /registry/pods/default/dragonfly-foo-0, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 23497fd7-ad07-4ddc-a700-b195782ee61d, UID in object meta: "}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222
2025-03-24T09:32:18Z INFO Received {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "322a428e-1846-4816-80e4-bf6f6d023ab7", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z INFO Received {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "6bfd89cc-7782-4ac9-8b99-9e77a53998a2", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z INFO Received {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "2d5a1f5a-2a50-4f02-a6fe-2a202baa5ca4", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z INFO Pod is not ready yet {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "2d5a1f5a-2a50-4f02-a6fe-2a202baa5ca4", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z INFO Received {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "8a0cd0b4-6c66-4dda-9134-101de3f12c5d", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z INFO Pod is not ready yet {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "8a0cd0b4-6c66-4dda-9134-101de3f12c5d", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z INFO Reconciling Dragonfly object {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z INFO Creating resources for dragonfly-foo {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z INFO updating existing resources {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z INFO Creating resources for dragonfly-foo {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z INFO Updated resources for object {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z INFO Received {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "46debbcd-f7ba-446c-bfd4-760570e0a7df", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z INFO Pod is not ready yet {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "46debbcd-f7ba-446c-bfd4-760570e0a7df", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:23Z INFO Received {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "c55a88ca-d6f8-48fd-9bfc-ab4d0baa15ff", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:23Z INFO Pod is not ready yet {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "c55a88ca-d6f8-48fd-9bfc-ab4d0baa15ff", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
Seeing the same issue:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
dragonfly-0 2/2 Running 0 23m 10.12.41.189 ip-10-10-5-209.eu-north-1.compute.internal <none> <none>
dragonfly-1 2/2 Running 0 22m 10.12.9.68 ip-10-10-3-32.eu-north-1.compute.internal <none> <none>
Where dragonfly-0 is the master. On node ip-10-10-5-209.eu-north-1.compute.internal just run ifconfig ens5 down and failover does not happen.
dragonfly.txt
Kubernetes version 1.31, dragonfly operator v1.1.10. Everything works with v1.1.8.
We just released a new version v1.1.11 which contains some failover monitoring improvements. Please try. If you are still facing any issues, I will take a look and fix.
Thanks, @Abhra303 will do during next month and will come back to you
I wasn’t able to fully reproduce the situation where the pod gets stuck in a Pending status, as in the first message. Now everything ran successfully, it just took quite a long time (8 minutes).
But it’s possible this happened because the operator was running on the same node as the df master, and when I shut down that node, both went down at once.
I20250421 08:21:25.601224 11 version_monitor.cc:174] Your current version '1.28.1' is not the latest version. A newer version '1.28.2' is now available. Please consider an update.
I20250421 08:21:37.808264 11 server_family.cc:2938] Replicating 172.16.20.70:9999
I20250421 08:21:37.815151 11 replica.cc:574] Started full sync with 172.16.20.70:9999
I20250421 08:21:37.815819 11 replica.cc:594] full sync finished in 3 ms
I20250421 08:21:37.815891 11 replica.cc:684] Transitioned into stable sync
W20250421 08:23:02.748415 11 common.cc:400] ReportError: Software caused connection abort
I20250421 08:23:02.749001 11 replica.cc:708] Exit stable sync
W20250421 08:23:02.749046 11 replica.cc:259] Error stable sync with 172.16.20.70:9999 system:103 Software caused connection abort
W20250421 08:23:03.249989 11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:111
W20250421 08:23:03.751040 11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:111
W20250421 08:23:04.252092 11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:111
W20250421 08:23:04.753152 11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:111
W20250421 08:23:06.253674 11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
W20250421 08:23:07.754175 11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
...
W20250421 08:24:01.771023 11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
W20250421 08:24:03.271564 11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
I20250421 08:24:04.480273 12 server_family.cc:2938] Replicating 172.16.22.84:9999
W20250421 08:24:04.480374 11 common.cc:400] ReportError: Operation canceled: ExecutionState cancelled
W20250421 08:24:04.772002 11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
I20250421 08:24:04.781895 12 replica.cc:574] Started full sync with 172.16.22.84:9999
I20250421 08:24:04.782686 12 replica.cc:594] full sync finished in 6 ms
I20250421 08:24:04.782757 12 replica.cc:684] Transitioned into stable sync
I20250421 08:25:00.001801 12 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/dump-summary.dfs" finished after 1 s
W20250421 08:25:03.157683 12 common.cc:400] ReportError: Software caused connection abort
I20250421 08:25:03.158156 12 replica.cc:708] Exit stable sync
W20250421 08:25:03.158183 12 replica.cc:259] Error stable sync with 172.16.22.84:9999 system:103 Software caused connection abort
W20250421 08:25:03.659197 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:111
W20250421 08:25:04.160629 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:111
W20250421 08:25:04.661645 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:111
W20250421 08:25:05.162596 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:111
W20250421 08:25:06.663034 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
W20250421 08:25:08.163501 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
W20250421 08:25:09.664011 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
W20250421 08:25:11.164708 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
...
W20250421 08:31:33.799101 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
I20250421 08:31:35.288286 12 server_family.cc:2938] Replicating NO ONE
W20250421 08:31:35.288419 12 common.cc:400] ReportError: Operation canceled: ExecutionState cancelled
W20250421 08:31:35.299602 12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
I20250421 08:31:35.799849 11 dflycmd.cc:647] Registered replica 172.16.26.187:6379
I20250421 08:31:35.803916 11 dflycmd.cc:345] Started sync with replica 172.16.26.187:6379
Next week I’ll be able to try it on the second cluster.
BTW, 8 minutes new master election is also kind of critical problem.
Hi @Abhra303, So far I haven't been able to reproduce this issue with the new version. The only problem, as I mentioned earlier, is when the operator runs on the same node as the master, and I have a question about that. Is it acceptable to scale the operator to multiple replicas?
I was just coming to open a ticket to mention the same thing but I'll comment here instead for now since it was likely at play here, the fatal flaw with how HA works at the moment is it depends on the operator being operational to change the labels for the service pod switch, if the current master and operator share the same node and that node dies then the relabeling cannot happen until the operator has been rescheduled somewhere else and started and from my brief attempt to scale the operator it does not currently support multiple replicas.
In an ideal world I would make sure they are not scheduled on the same node, but in a simple 2-3 node cluster it won't be possible to guarantee without creating dedicated nodes/node-pools for the operator, it also applies in regional outages since the operator could share the same region but not same node then there will be trouble too.
I like the simplicity of this HA model compared to having to run Redis cluster, or sentinels particularly when I have to host a service where they didn't think to support sentinels or clusters but a step in the right direction would be to be able to scale the operator so it is also highly available as well since it is key in the fail-over process.
@micahnz you can use couple replicas of operator with leader election, this way you will reduce time to switchover when one of them is unavailable.
Thanks @iyuroch I just realized I did something dumb at the end of the day yesterday and why I couldn't get two operators running.
Worth while mentioning in the docs though I think.
I just got the same today: my master dragonfly-0 got OOM-killed (which I need to diagnose on its own). The replica was up to date, but it seems that the old master got up and running again fast enough that the operator didn't have time to set dragonfly-1 to master.
What makes me suspect this is the following section:
# old master gets killed
W20250925 10:22:17.559451 11 replica.cc:233] Error connecting to 10.0.2.39:9999 (phase: TCP_CONNECTING): system:111, reason: Connection refused
...
# starts sync with the new, empty dragonfly-0?!
I20250925 10:22:31.007472 11 replica.cc:623] Started full sync with 10.0.2.39:9999
# coming from the operator I suppose, but a bit late...
I20250925 10:22:31.008371 11 server_family.cc:3400] Replicating NO ONE
I might be completely off here so any insight is welcome.