argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Argo fails to detect pod ready state for some operators

Open sarahhenkens opened this issue 3 years ago • 12 comments

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [x] I've included steps to reproduce the bug.
  • [x] I've pasted the output of argocd version.

Describe the bug

When using the ArangoDB operator with ArgoCD. Any pod created (and attached to the custom resource) gets stuck in a forever "Progressing" state. While a kube describe pod <pod-id> is showing a ready state:

Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 

Discussion in Slack: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1631432657161800

To Reproduce

  • Install the ArangoDb operator in your cluster
export URLPREFIX=https://github.com/arangodb/kube-arangodb/releases/download/1.2.2
helm install $URLPREFIX/kube-arangodb-crd-1.2.2.tgz
helm install $URLPREFIX/kube-arangodb-1.2.2.tgz
  • Apply the following demo project in ArgoCD:

It will load all the examples from https://github.com/arangodb/kube-arangodb into the default namespace.

project: default
source:
  repoURL: 'https://github.com/arangodb/kube-arangodb'
  path: examples
  targetRevision: HEAD
destination:
  server: 'https://kubernetes.default.svc'
  namespace: default
  • Observe the lifecycle of the generates pods/svcs/endpoints by the operator within argocd.

image

Expected behavior

The pod is expected to show as Healthy in the ArgoCD UI and reports once running with the Ready state set to true.

Version

v2.1.2+7af9dfb

sarahhenkens avatar Sep 19 '21 22:09 sarahhenkens

Hmm I think this is the same issue as https://github.com/argoproj/argo-cd/issues/7182. This operator sets the restart policy to "never" and argocd keeps those into a progressing state:

inside: getCorev1PodHealth:

	case corev1.PodRunning:
		switch pod.Spec.RestartPolicy {
		case corev1.RestartPolicyAlways:
			// if pod is ready, it is automatically healthy
			if podutils.IsPodReady(pod) {
				return &HealthStatus{
					Status:  HealthStatusHealthy,
					Message: pod.Status.Message,
				}, nil
			}
			// if it's not ready, check to see if any container terminated, if so, it's degraded
			for _, ctrStatus := range pod.Status.ContainerStatuses {
				if ctrStatus.LastTerminationState.Terminated != nil {
					return &HealthStatus{
						Status:  HealthStatusDegraded,
						Message: pod.Status.Message,
					}, nil
				}
			}
			// otherwise we are progressing towards a ready state
			return &HealthStatus{
				Status:  HealthStatusProgressing,
				Message: pod.Status.Message,
			}, nil
		case corev1.RestartPolicyOnFailure, corev1.RestartPolicyNever:
			// pods set with a restart policy of OnFailure or Never, have a finite life.
			// These pods are typically resource hooks. Thus, we consider these as Progressing
			// instead of healthy.
			return &HealthStatus{
				Status:  HealthStatusProgressing,
				Message: pod.Status.Message,
			}, nil
		}
	}

sarahhenkens avatar Sep 20 '21 02:09 sarahhenkens

Root cause inside the ArangoDB operator: https://github.com/arangodb/kube-arangodb/blob/13f3e2a09b4c6c08f050efffc364d498b1293dcf/pkg/util/k8sutil/pods.go#L433

Is there a better way to let ArgoCD still let pods be considered healthy with a custom setting?

sarahhenkens avatar Sep 20 '21 02:09 sarahhenkens

What is the rational reason to do Never from ArangoDB? I believe that question has to be explored.

wanghong230 avatar Sep 20 '21 17:09 wanghong230

From the linked ticket:

We do not want to allow Pod restarts, full lifecycle is managed by Operator (Operator recreate pod, takes care about shards).

sarahhenkens avatar Sep 21 '21 02:09 sarahhenkens

We need to have a quick discussion about this. I will bring it up in tomorrow's maintainer meeting.

wanghong230 avatar Sep 23 '21 03:09 wanghong230

The same issue: https://github.com/argoproj/argo-cd/issues/7182

wanghong230 avatar Sep 23 '21 16:09 wanghong230

@wanghong230, Any updates from the maintainer meeting?

sarahhenkens avatar Oct 10 '21 04:10 sarahhenkens

I have this same issue when using the Spark Operator. The driver and executors have a restart policy of never and continue to show progressing when the pod state is running.

michael-barker avatar May 11 '22 16:05 michael-barker

There are many operators that behave this way including Koperator and NiFiKop. This behavior should at least be configurable through an Application/ApplicationSet.

mh013370 avatar Aug 23 '22 13:08 mh013370

I have same issue with spark operator, too. Many operators make pods' restart policy to Never.

trasyia avatar Nov 08 '22 08:11 trasyia

... and also with the Task Manager container, that is managed by the Flink Operator.

https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#pod-template

lintong avatar Dec 25 '23 16:12 lintong

Related/duplicate issue: https://github.com/argoproj/argo-cd/issues/7182.

mikejoh avatar Apr 25 '24 08:04 mikejoh