opensearch-k8s-operator
opensearch-k8s-operator copied to clipboard
updateStrategy OnDelete on the nodes statefulset causes revision mismatch
We are using Prometheus and kube-state-metrics to monitor our cluster. One of the alert rules we use monitors the amount of current replicas vs. ready replicas of StatefulSets.
PromQL is as follows:
kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current != 1
Seeing as spec.updateStrategy.type gets set to OnDelete and not RollingUpdate, we see the following in the status object:
status:
observedGeneration: 9
replicas: 2
readyReplicas: 2
updatedReplicas: 2
currentRevision: opensearch-masters-5fdb995b4
updateRevision: opensearch-masters-7fd87c8d8b
collisionCount: 0
availableReplicas: 2
currentRevision and updateRevision are not equal, and as such kube_statefulset_status_replicas_current reports 0 while it should be 2.
I have tried to manually delete the pods, but this seems not to change anything. Because of this there is a false positive alert from that alert rule. More information here: https://github.com/kubernetes/kube-state-metrics/issues/1324
Is there a specific reason spec.updateStrategy.type is set to OnDelete? By my understanding, setting that to RollingUpdate should fix the issue.
Hi @gk-mevers. updateStrategy is set to OnDelete to allow the operator to execute rolling restarts and upgrades in a controlled fashion. This allows the operator to do node drains before restarting or upgrading a node and it can also wait for cluster health. It basically moves control over when to replace a pod from kubernetes control plane to the operator. Using RollingUpdate would not give us that level of control.
I was able to reproduce your observations. Taking the info from the link you posted I believe that kubernetes does not automatically update the currentRevision to updateRevision even if all pods are up-to-date. I think to facilitate this we need to extend the operator to update the revision for a statefulset after it has completed its work.
I'll mark this ticket as an enhacement. Should you have the time and inclination to have a go at it, PRs are always welcome.
This is fixed by #614, right?