argo-cd
argo-cd copied to clipboard
Argo fails to detect pod ready state for some operators
Checklist:
- [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- [x] I've included steps to reproduce the bug.
- [x] I've pasted the output of
argocd version
.
Describe the bug
When using the ArangoDB operator with ArgoCD. Any pod created (and attached to the custom resource) gets stuck in a forever "Progressing" state. While a kube describe pod <pod-id>
is showing a ready state:
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Discussion in Slack: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1631432657161800
To Reproduce
- Install the ArangoDb operator in your cluster
export URLPREFIX=https://github.com/arangodb/kube-arangodb/releases/download/1.2.2
helm install $URLPREFIX/kube-arangodb-crd-1.2.2.tgz
helm install $URLPREFIX/kube-arangodb-1.2.2.tgz
- Apply the following demo project in ArgoCD:
It will load all the examples from https://github.com/arangodb/kube-arangodb into the default namespace.
project: default
source:
repoURL: 'https://github.com/arangodb/kube-arangodb'
path: examples
targetRevision: HEAD
destination:
server: 'https://kubernetes.default.svc'
namespace: default
- Observe the lifecycle of the generates pods/svcs/endpoints by the operator within argocd.
Expected behavior
The pod is expected to show as Healthy
in the ArgoCD UI and reports once running with the Ready
state set to true
.
Version
v2.1.2+7af9dfb
Hmm I think this is the same issue as https://github.com/argoproj/argo-cd/issues/7182. This operator sets the restart policy to "never" and argocd keeps those into a progressing state:
inside: getCorev1PodHealth
:
case corev1.PodRunning:
switch pod.Spec.RestartPolicy {
case corev1.RestartPolicyAlways:
// if pod is ready, it is automatically healthy
if podutils.IsPodReady(pod) {
return &HealthStatus{
Status: HealthStatusHealthy,
Message: pod.Status.Message,
}, nil
}
// if it's not ready, check to see if any container terminated, if so, it's degraded
for _, ctrStatus := range pod.Status.ContainerStatuses {
if ctrStatus.LastTerminationState.Terminated != nil {
return &HealthStatus{
Status: HealthStatusDegraded,
Message: pod.Status.Message,
}, nil
}
}
// otherwise we are progressing towards a ready state
return &HealthStatus{
Status: HealthStatusProgressing,
Message: pod.Status.Message,
}, nil
case corev1.RestartPolicyOnFailure, corev1.RestartPolicyNever:
// pods set with a restart policy of OnFailure or Never, have a finite life.
// These pods are typically resource hooks. Thus, we consider these as Progressing
// instead of healthy.
return &HealthStatus{
Status: HealthStatusProgressing,
Message: pod.Status.Message,
}, nil
}
}
Root cause inside the ArangoDB operator: https://github.com/arangodb/kube-arangodb/blob/13f3e2a09b4c6c08f050efffc364d498b1293dcf/pkg/util/k8sutil/pods.go#L433
Is there a better way to let ArgoCD still let pods be considered healthy with a custom setting?
What is the rational reason to do Never from ArangoDB? I believe that question has to be explored.
From the linked ticket:
We do not want to allow Pod restarts, full lifecycle is managed by Operator (Operator recreate pod, takes care about shards).
We need to have a quick discussion about this. I will bring it up in tomorrow's maintainer meeting.
The same issue: https://github.com/argoproj/argo-cd/issues/7182
@wanghong230, Any updates from the maintainer meeting?
I have this same issue when using the Spark Operator. The driver and executors have a restart policy of never and continue to show progressing when the pod state is running.
There are many operators that behave this way including Koperator
and NiFiKop
. This behavior should at least be configurable through an Application
/ApplicationSet
.
I have same issue with spark operator, too. Many operators make pods' restart policy to Never.
... and also with the Task Manager container, that is managed by the Flink Operator.
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#pod-template
Related/duplicate issue: https://github.com/argoproj/argo-cd/issues/7182.