dragonfly-operator icon indicating copy to clipboard operation
dragonfly-operator copied to clipboard

The Dragonfly master does not switch when a node fails in Kubernetes.

Open Kirgod opened this issue 9 months ago • 17 comments

Describe the bug There is Dragonfly running in Kubernetes (1 master, 2 replicas). The second time during issue with node, the master does not failover to the slave.

To Reproduce Steps to reproduce the behavior:

  1. Run dragonfly in k8s
  2. Check in which node dragonfly master is running
  3. Shutdown the node

Expected behavior New master should be elected, if the master is unavailable.

Environment:

  • Workers OS: Ubuntu 22.04.4 LTS
  • Workers Kernel: 5.15.0-105-generic
  • Containerized: Kubernetes v1.29.2
  • dragonfly v1.26.2-096fde172300de91850c42dab24aa09ffee254d0
  • image: docker.dragonflydb.io/dragonflydb/operator:v1.1.9

Reproducible Code Snippet

k get po -l master
shutdown now # node where pod with master is running
# wait
k get po -l master # (no master)

Additional context Master pod in pending state:

dragonfly-foo-0                             0/1     Pending   0                54m
dragonfly-foo-1                             1/1     Running   0                22d
dragonfly-foo-2                             1/1     Running   0                22d

Info from slaves:

root@dragonfly-foo-1:/data# redis-cli
127.0.0.1:6379> info replication
# Replication
role:slave
master_host:172.20.31.160
master_port:9999
master_link_status:down
master_last_io_seconds_ago:2809
master_sync_in_progress:0
master_replid:779aexec8fa961f6e7a431b46bb37ff4d86c41bc
slave_priority:100
slave_read_only:1

root@dragonfly-foo-2:/data# redis-cli
127.0.0.1:6379>  info replication
# Replication
role:slave
master_host:172.20.31.160
master_port:9999
master_link_status:down
master_last_io_seconds_ago:2829
master_sync_in_progress:0
master_replid:779aexec8fa961f6e7a431b46bb37ff4d86c41bc
slave_priority:100
slave_read_only:1

And in the pod logs just:

W20250311 14:38:34.344835    11 replica.cc:217] Error connecting to 172.20.31.160:9999 system:125
W20250311 14:38:35.845142    11 replica.cc:217] Error connecting to 172.20.31.160:9999 system:125

Kirgod avatar Mar 12 '25 10:03 Kirgod

same issue here..

makes the "HA" setup kind of useless

eloo avatar Mar 22 '25 10:03 eloo

Dragonfly Operator doesn't support cluster. Can you try with cluster mode off and check if you can reproduce? Also please share the operator logs here.

Abhra303 avatar Mar 24 '25 08:03 Abhra303

@Abhra303 at least im running in cluster_mode=emulated is this also not supported?

and if not.. how would a "real" HA setup look like using the operator if no cluster_modes are not supported? i really can't find anything in the docs

eloo avatar Mar 24 '25 08:03 eloo

@Abhra303 we are not using --cluster-mode flag at all (I removed cluster mode from first post, to avoid misunderstanding):

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  namespace: default
  labels:
    ...
  name: dragonfly-foo
spec:
  args:
    - "--dbfilename=dump"
  replicas: 3
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - dragonfly-foo
        topologyKey: "kubernetes.io/hostname"
  tolerations:
    - key: "node.kubernetes.io/not-ready"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 10
  snapshot:
    cron: "*/5 * * * *"
    persistentVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi

Kirgod avatar Mar 24 '25 09:03 Kirgod

@Abhra303 at least im running in cluster_mode=emulated is this also not supported?

That is supported.

Abhra303 avatar Mar 24 '25 09:03 Abhra303

@Abhra303 we are not using --cluster-mode flag at all:

Configuration looks good. Could you share the operator logs?

Abhra303 avatar Mar 24 '25 09:03 Abhra303

This time I was able to reproduce it in Minikube, but it's behaving a bit differently (switching after couple of minutes, and master pod in Terminating status, instead of Pending as before) To fully reproduce it on bare-metal machines, I'll need some time before I can do that.

Logs from operator (minikube):

2025-03-24T09:32:18Z    INFO    getting all pods relevant to the instance       {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "1d8a054a-9348-485d-9824-f7ae333b368c"}
2025-03-24T09:32:18Z    ERROR   couldn't find healthy and mark active   {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "1d8a054a-9348-485d-9824-f7ae333b368c", "error": "Operation cannot be fulfilled on pods \"dragonfly-foo-0\": StorageError: invalid object, Code: 4, Key: /registry/pods/default/dragonfly-foo-0, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 23497fd7-ad07-4ddc-a700-b195782ee61d, UID in object meta: "}
github.com/dragonflydb/dragonfly-operator/internal/controller.(*DfPodLifeCycleReconciler).Reconcile
        /workspace/internal/controller/dragonfly_pod_lifecycle_controller.go:165
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222
2025-03-24T09:32:18Z    INFO    Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler      {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "1d8a054a-9348-485d-9824-f7ae333b368c"}
2025-03-24T09:32:18Z    ERROR   Reconciler error        {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "1d8a054a-9348-485d-9824-f7ae333b368c", "error": "Operation cannot be fulfilled on pods \"dragonfly-foo-0\": StorageError: invalid object, Code: 4, Key: /registry/pods/default/dragonfly-foo-0, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 23497fd7-ad07-4ddc-a700-b195782ee61d, UID in object meta: "}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222
2025-03-24T09:32:18Z    INFO    Received        {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "322a428e-1846-4816-80e4-bf6f6d023ab7", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z    INFO    Received        {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "6bfd89cc-7782-4ac9-8b99-9e77a53998a2", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z    INFO    Received        {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "2d5a1f5a-2a50-4f02-a6fe-2a202baa5ca4", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z    INFO    Pod is not ready yet    {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "2d5a1f5a-2a50-4f02-a6fe-2a202baa5ca4", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z    INFO    Received        {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "8a0cd0b4-6c66-4dda-9134-101de3f12c5d", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z    INFO    Pod is not ready yet    {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "8a0cd0b4-6c66-4dda-9134-101de3f12c5d", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z    INFO    Reconciling Dragonfly object    {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z    INFO    Creating resources for dragonfly-foo    {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z    INFO    updating existing resources     {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z    INFO    Creating resources for dragonfly-foo    {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z    INFO    Updated resources for object    {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-foo","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo", "reconcileID": "0bfb45c3-3fc1-47c2-a21c-c88000c1fb2f"}
2025-03-24T09:32:18Z    INFO    Received        {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "46debbcd-f7ba-446c-bfd4-760570e0a7df", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:18Z    INFO    Pod is not ready yet    {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "46debbcd-f7ba-446c-bfd4-760570e0a7df", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:23Z    INFO    Received        {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "c55a88ca-d6f8-48fd-9bfc-ab4d0baa15ff", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}
2025-03-24T09:32:23Z    INFO    Pod is not ready yet    {"controller": "DragonflyPodLifecycle", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"dragonfly-foo-0","namespace":"default"}, "namespace": "default", "name": "dragonfly-foo-0", "reconcileID": "c55a88ca-d6f8-48fd-9bfc-ab4d0baa15ff", "pod": {"name":"dragonfly-foo-0","namespace":"default"}}

Kirgod avatar Mar 24 '25 09:03 Kirgod

here are the logs when you just shutdown the node where the master is running

dragonfly.txt

eloo avatar Mar 24 '25 18:03 eloo

Seeing the same issue:

NAME          READY   STATUS    RESTARTS   AGE   IP             NODE                                         NOMINATED NODE   READINESS GATES
dragonfly-0   2/2     Running   0          23m   10.12.41.189   ip-10-10-5-209.eu-north-1.compute.internal   <none>           <none>
dragonfly-1   2/2     Running   0          22m   10.12.9.68     ip-10-10-3-32.eu-north-1.compute.internal    <none>           <none>

Where dragonfly-0 is the master. On node ip-10-10-5-209.eu-north-1.compute.internal just run ifconfig ens5 down and failover does not happen. dragonfly.txt

Kubernetes version 1.31, dragonfly operator v1.1.10. Everything works with v1.1.8.

sergeij avatar Mar 27 '25 08:03 sergeij

We just released a new version v1.1.11 which contains some failover monitoring improvements. Please try. If you are still facing any issues, I will take a look and fix.

Abhra303 avatar Apr 09 '25 07:04 Abhra303

Thanks, @Abhra303 will do during next month and will come back to you

Kirgod avatar Apr 16 '25 07:04 Kirgod

I wasn’t able to fully reproduce the situation where the pod gets stuck in a Pending status, as in the first message. Now everything ran successfully, it just took quite a long time (8 minutes). But it’s possible this happened because the operator was running on the same node as the df master, and when I shut down that node, both went down at once.

I20250421 08:21:25.601224    11 version_monitor.cc:174] Your current version '1.28.1' is not the latest version. A newer version '1.28.2' is now available. Please consider an update.
I20250421 08:21:37.808264    11 server_family.cc:2938] Replicating 172.16.20.70:9999
I20250421 08:21:37.815151    11 replica.cc:574] Started full sync with 172.16.20.70:9999
I20250421 08:21:37.815819    11 replica.cc:594] full sync finished in 3 ms
I20250421 08:21:37.815891    11 replica.cc:684] Transitioned into stable sync
W20250421 08:23:02.748415    11 common.cc:400] ReportError: Software caused connection abort
I20250421 08:23:02.749001    11 replica.cc:708] Exit stable sync
W20250421 08:23:02.749046    11 replica.cc:259] Error stable sync with 172.16.20.70:9999 system:103 Software caused connection abort
W20250421 08:23:03.249989    11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:111
W20250421 08:23:03.751040    11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:111
W20250421 08:23:04.252092    11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:111
W20250421 08:23:04.753152    11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:111
W20250421 08:23:06.253674    11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
W20250421 08:23:07.754175    11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
...
W20250421 08:24:01.771023    11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
W20250421 08:24:03.271564    11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
I20250421 08:24:04.480273    12 server_family.cc:2938] Replicating 172.16.22.84:9999
W20250421 08:24:04.480374    11 common.cc:400] ReportError: Operation canceled: ExecutionState cancelled
W20250421 08:24:04.772002    11 replica.cc:211] Error connecting to 172.16.20.70:9999 system:125
I20250421 08:24:04.781895    12 replica.cc:574] Started full sync with 172.16.22.84:9999
I20250421 08:24:04.782686    12 replica.cc:594] full sync finished in 6 ms
I20250421 08:24:04.782757    12 replica.cc:684] Transitioned into stable sync
I20250421 08:25:00.001801    12 save_stages_controller.cc:342] Saving "/dragonfly/snapshots/dump-summary.dfs" finished after 1 s
W20250421 08:25:03.157683    12 common.cc:400] ReportError: Software caused connection abort
I20250421 08:25:03.158156    12 replica.cc:708] Exit stable sync
W20250421 08:25:03.158183    12 replica.cc:259] Error stable sync with 172.16.22.84:9999 system:103 Software caused connection abort
W20250421 08:25:03.659197    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:111
W20250421 08:25:04.160629    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:111
W20250421 08:25:04.661645    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:111
W20250421 08:25:05.162596    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:111
W20250421 08:25:06.663034    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
W20250421 08:25:08.163501    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
W20250421 08:25:09.664011    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
W20250421 08:25:11.164708    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
...
W20250421 08:31:33.799101    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
I20250421 08:31:35.288286    12 server_family.cc:2938] Replicating NO ONE
W20250421 08:31:35.288419    12 common.cc:400] ReportError: Operation canceled: ExecutionState cancelled
W20250421 08:31:35.299602    12 replica.cc:211] Error connecting to 172.16.22.84:9999 system:125
I20250421 08:31:35.799849    11 dflycmd.cc:647] Registered replica 172.16.26.187:6379
I20250421 08:31:35.803916    11 dflycmd.cc:345] Started sync with replica 172.16.26.187:6379

Next week I’ll be able to try it on the second cluster.

BTW, 8 minutes new master election is also kind of critical problem.

Kirgod avatar Apr 21 '25 08:04 Kirgod

Hi @Abhra303, So far I haven't been able to reproduce this issue with the new version. The only problem, as I mentioned earlier, is when the operator runs on the same node as the master, and I have a question about that. Is it acceptable to scale the operator to multiple replicas?

Kirgod avatar May 06 '25 11:05 Kirgod

I was just coming to open a ticket to mention the same thing but I'll comment here instead for now since it was likely at play here, the fatal flaw with how HA works at the moment is it depends on the operator being operational to change the labels for the service pod switch, if the current master and operator share the same node and that node dies then the relabeling cannot happen until the operator has been rescheduled somewhere else and started and from my brief attempt to scale the operator it does not currently support multiple replicas.

In an ideal world I would make sure they are not scheduled on the same node, but in a simple 2-3 node cluster it won't be possible to guarantee without creating dedicated nodes/node-pools for the operator, it also applies in regional outages since the operator could share the same region but not same node then there will be trouble too.

I like the simplicity of this HA model compared to having to run Redis cluster, or sentinels particularly when I have to host a service where they didn't think to support sentinels or clusters but a step in the right direction would be to be able to scale the operator so it is also highly available as well since it is key in the fail-over process.

micahnz avatar Jul 30 '25 11:07 micahnz

@micahnz you can use couple replicas of operator with leader election, this way you will reduce time to switchover when one of them is unavailable.

iyuroch avatar Jul 30 '25 21:07 iyuroch

Thanks @iyuroch I just realized I did something dumb at the end of the day yesterday and why I couldn't get two operators running.

Worth while mentioning in the docs though I think.

micahnz avatar Jul 31 '25 06:07 micahnz

I just got the same today: my master dragonfly-0 got OOM-killed (which I need to diagnose on its own). The replica was up to date, but it seems that the old master got up and running again fast enough that the operator didn't have time to set dragonfly-1 to master.

operator-logs.txt

dragonfly-1-logs.txt

What makes me suspect this is the following section:

# old master gets killed
W20250925 10:22:17.559451    11 replica.cc:233] Error connecting to 10.0.2.39:9999 (phase: TCP_CONNECTING): system:111, reason: Connection refused

...

# starts sync with the new, empty dragonfly-0?!
I20250925 10:22:31.007472    11 replica.cc:623] Started full sync with 10.0.2.39:9999

# coming from the operator I suppose, but a bit late...
I20250925 10:22:31.008371    11 server_family.cc:3400] Replicating NO ONE

I might be completely off here so any insight is welcome.

abustany avatar Sep 25 '25 11:09 abustany