Consider load time in healthcheck
Describe the bug When we need to do a helm upgrade on our dragonfly instances, the rollout will roll out one pod at a time and perform a rudimentary health check (ping) on the pod, before killing the previous pod. But the new pod still needs to load an entire snapshot into memory, which could take up to 4 minutes.
This results in 4 minutes of down time for us. Dragonfly returns an error (i.e., Dragonfly is loading the dataset in memory) while still loading the snapshot into memory. It seems reasonable to take this into account for readiness.
As discussed on discord: https://discord.com/channels/981533931486724126/1421202349451382896
Thanks!
To Reproduce Steps to reproduce the behavior: We are running 3 dragonfly pods with 250Gi and a dragonfly operator. Persistenvolume is also 250Gi and we are currently using around 140Gi of the storage.
Expected behavior The pod should be ready only when the load is finished
Environment (please complete the following information):
- OS: Ubuntu 22.04.5 LTS
- Kernel: Linux mypod 5.15.0-1086-azure #95-Ubuntu SMP Thu Mar 27 17:39:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
- Containerized?: Kubernetes
- Dragonfly Version: v1.30.1
- Dragonfly operator helm: chart-version v1.1.11
Duplicate of #5881 and https://github.com/dragonflydb/dragonfly-operator/issues/397
In my opinion, the correct approach is to implement this at the operator/helm chart level, and configure a custom readiness probe for k8s. We'd gladly review a contribution that adds such a probe.
something along these lines:
pod.Spec:
volumeMounts:
- name: script-volume
mountPath: /scripts
startupProbe:
exec:
command:
- /scripts/readiness-check.sh
volumes:
- name: health-script-volume
configMap:
name: health-script-cm