dragonfly Consider load time in healthcheck

Describe the bug When we need to do a helm upgrade on our dragonfly instances, the rollout will roll out one pod at a time and perform a rudimentary health check (ping) on the pod, before killing the previous pod. But the new pod still needs to load an entire snapshot into memory, which could take up to 4 minutes.

This results in 4 minutes of down time for us. Dragonfly returns an error (i.e., Dragonfly is loading the dataset in memory) while still loading the snapshot into memory. It seems reasonable to take this into account for readiness.

As discussed on discord: https://discord.com/channels/981533931486724126/1421202349451382896

Thanks!

To Reproduce Steps to reproduce the behavior: We are running 3 dragonfly pods with 250Gi and a dragonfly operator. Persistenvolume is also 250Gi and we are currently using around 140Gi of the storage.

Expected behavior The pod should be ready only when the load is finished

Environment (please complete the following information):

OS: Ubuntu 22.04.5 LTS
Kernel: Linux mypod 5.15.0-1086-azure #95-Ubuntu SMP Thu Mar 27 17:39:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Containerized?: Kubernetes
Dragonfly Version: v1.30.1
Dragonfly operator helm: chart-version v1.1.11

Oct 16 '25 06:10 kristjgr

Duplicate of #5881 and https://github.com/dragonflydb/dragonfly-operator/issues/397

In my opinion, the correct approach is to implement this at the operator/helm chart level, and configure a custom readiness probe for k8s. We'd gladly review a contribution that adds such a probe.

Oct 16 '25 07:10 romange

something along these lines:

pod.Spec:
  volumeMounts:
    - name: script-volume
      mountPath: /scripts

  startupProbe:
      exec:
        command:
         - /scripts/readiness-check.sh
volumes:
   - name: health-script-volume
     configMap:
       name: health-script-cm

Oct 16 '25 10:10 romange