argo-cd Argo fails to detect pod ready state for some operators

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[x] I've included steps to reproduce the bug.
[x] I've pasted the output of argocd version.

Describe the bug

When using the ArangoDB operator with ArgoCD. Any pod created (and attached to the custom resource) gets stuck in a forever "Progressing" state. While a kube describe pod <pod-id> is showing a ready state:

Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True

Discussion in Slack: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1631432657161800

To Reproduce

Install the ArangoDb operator in your cluster

export URLPREFIX=https://github.com/arangodb/kube-arangodb/releases/download/1.2.2
helm install $URLPREFIX/kube-arangodb-crd-1.2.2.tgz
helm install $URLPREFIX/kube-arangodb-1.2.2.tgz

Apply the following demo project in ArgoCD:

It will load all the examples from https://github.com/arangodb/kube-arangodb into the default namespace.

project: default
source:
  repoURL: 'https://github.com/arangodb/kube-arangodb'
  path: examples
  targetRevision: HEAD
destination:
  server: 'https://kubernetes.default.svc'
  namespace: default

Observe the lifecycle of the generates pods/svcs/endpoints by the operator within argocd.

Expected behavior

The pod is expected to show as Healthy in the ArgoCD UI and reports once running with the Ready state set to true.

Version

v2.1.2+7af9dfb

Sep 19 '21 22:09 sarahhenkens

Hmm I think this is the same issue as https://github.com/argoproj/argo-cd/issues/7182. This operator sets the restart policy to "never" and argocd keeps those into a progressing state:

inside: getCorev1PodHealth:

	case corev1.PodRunning:
		switch pod.Spec.RestartPolicy {
		case corev1.RestartPolicyAlways:
			// if pod is ready, it is automatically healthy
			if podutils.IsPodReady(pod) {
				return &HealthStatus{
					Status:  HealthStatusHealthy,
					Message: pod.Status.Message,
				}, nil
			}
			// if it's not ready, check to see if any container terminated, if so, it's degraded
			for _, ctrStatus := range pod.Status.ContainerStatuses {
				if ctrStatus.LastTerminationState.Terminated != nil {
					return &HealthStatus{
						Status:  HealthStatusDegraded,
						Message: pod.Status.Message,
					}, nil
				}
			}
			// otherwise we are progressing towards a ready state
			return &HealthStatus{
				Status:  HealthStatusProgressing,
				Message: pod.Status.Message,
			}, nil
		case corev1.RestartPolicyOnFailure, corev1.RestartPolicyNever:
			// pods set with a restart policy of OnFailure or Never, have a finite life.
			// These pods are typically resource hooks. Thus, we consider these as Progressing
			// instead of healthy.
			return &HealthStatus{
				Status:  HealthStatusProgressing,
				Message: pod.Status.Message,
			}, nil
		}
	}

Sep 20 '21 02:09 sarahhenkens

Root cause inside the ArangoDB operator: https://github.com/arangodb/kube-arangodb/blob/13f3e2a09b4c6c08f050efffc364d498b1293dcf/pkg/util/k8sutil/pods.go#L433

Is there a better way to let ArgoCD still let pods be considered healthy with a custom setting?

Sep 20 '21 02:09 sarahhenkens

What is the rational reason to do Never from ArangoDB? I believe that question has to be explored.

Sep 20 '21 17:09 wanghong230

From the linked ticket:

We do not want to allow Pod restarts, full lifecycle is managed by Operator (Operator recreate pod, takes care about shards).

Sep 21 '21 02:09 sarahhenkens

We need to have a quick discussion about this. I will bring it up in tomorrow's maintainer meeting.

Sep 23 '21 03:09 wanghong230

The same issue: https://github.com/argoproj/argo-cd/issues/7182

Sep 23 '21 16:09 wanghong230

@wanghong230, Any updates from the maintainer meeting?

Oct 10 '21 04:10 sarahhenkens

I have this same issue when using the Spark Operator. The driver and executors have a restart policy of never and continue to show progressing when the pod state is running.

May 11 '22 16:05 michael-barker

There are many operators that behave this way including Koperator and NiFiKop. This behavior should at least be configurable through an Application/ApplicationSet.

Aug 23 '22 13:08 mh013370

I have same issue with spark operator, too. Many operators make pods' restart policy to Never.

Nov 08 '22 08:11 trasyia

... and also with the Task Manager container, that is managed by the Flink Operator.

https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#pod-template

Dec 25 '23 16:12 lintong

Related/duplicate issue: https://github.com/argoproj/argo-cd/issues/7182.

Apr 25 '24 08:04 mikejoh

argo-cd argo-cd copied to clipboard

Argo fails to detect pod ready state for some operators

argo-cd
argo-cd copied to clipboard