argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

App health stuck at "Progressing" forever when pod has `restartPolicy: Never`

Open lindhe opened this issue 1 year ago β€’ 11 comments

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [x] I've included steps to reproduce the bug.
  • [x] I've pasted the output of argocd version.

Describe the bug

When an Application deploys a pod that has restartPolicy: Never, the Application gets stuck with health "Progressing" seemingly forever.

To Reproduce

Apply this manifest:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  finalizers:
    - resources-finalizer.argocd.argoproj.io
  name: debug
  namespace: argocd
spec:
  destination:
    server: https://kubernetes.default.svc
    namespace: default
  source:
    path: manifests
    repoURL: https://github.com/lindhe/debug-argocd-15317.git

That app installs just this seemingly innocuous pod:

apiVersion: v1
kind: Pod
metadata:
  name: foo
spec:
  containers:
    - image: busybox
      command: ['sh', '-c', 'sleep 3600']
      name: bar
  restartPolicy: Never

Expected behavior

The application health should soon become "Healthy".

Screenshots

Screenshot 2023-09-01 083537

Version

argocd: v2.7.9+0ee33e5
  BuildDate: 2023-07-24T18:26:12Z
  GitCommit: 0ee33e52dd1f1bb944488584fc6f854b929f1180
  GitTreeState: clean
  GoVersion: go1.19.11
  Compiler: gc
  Platform: linux/amd64
WARN[0000] Failed to invoke grpc call. Use flag --grpc-web in grpc calls. To avoid this warning message, use flag --grpc-web.
argocd-server: v2.8.0+804d4b8

Logs

No relevant logs that I've found.

Other information

Similar to #5620 and some other issues found by searching for "health progressing".

I'm running an RKE2 cluster with version v1.26.6+rke2r1 and Cilium 1.13.2 installed.

lindhe avatar Sep 01 '23 07:09 lindhe

This seems to be by design, or rather there is nothing currently better than what there is (see the relevant snippet in gitops engine).

blakepettersson avatar Sep 01 '23 08:09 blakepettersson

Ah, that's interesting. Thanks for finding it!

It sounds to me like ignoring all pods that sets restartPolicy: Never or restartPolicy: OnFailure is a very blunt tool to achieve those ends. Perhaps we can find a better way of doing things?

After reading the documentation, it is my understanding that all resource hooks have the argocd.argoproj.io/hook annotation. Perhaps we could therefore instead look for pods that fulfills these requirements:

  • has restartPolicy: Never or restartPolicy: OnFailure.
  • has the argocd.argoproj.io/hook annotation or is owned by an object with that annotation.

Do you think it sounds reasonable to look for an alternative solution like this, in order to avoid these unexpected behaviors?

lindhe avatar Sep 01 '23 13:09 lindhe

Perhaps we could therefore instead look for pods that fulfills these requirements:

From my perspective that does sound reasonable. I suspect that there might be a caveat or two to this which would still make this a no-go. @jessesuen do you have any input regarding this?

blakepettersson avatar Sep 02 '23 13:09 blakepettersson

@lindhe perhaps the way to move this forward is to come in one of the Argo CD Contributors meetings and make a proposal?

blakepettersson avatar Sep 25 '23 12:09 blakepettersson

Sure, I'd love to. Where do I find the schedule?

lindhe avatar Sep 25 '23 13:09 lindhe

You can see the agenda and Zoom link here. TLDR every Thursday at 17.15 CET

blakepettersson avatar Sep 25 '23 13:09 blakepettersson

Great. I don't have time this week, but expect me soon! :)

lindhe avatar Sep 26 '23 12:09 lindhe

Hey there, any updates on this one? @lindhe

nweisenauer-sap avatar Jan 03 '24 07:01 nweisenauer-sap

Sorry, I dropped the ball on this one. πŸ˜• Q4 last year was hectic for me. Still plan to push for it on the contributors meeting, but if someone has time to join and discuss it before I have time, go for it! πŸ‘

lindhe avatar Jan 03 '24 08:01 lindhe

I also met this problem for flink task manager, it may affect the flink autoscale.

flink operator release-1.7.0

jpuyy avatar Mar 13 '24 02:03 jpuyy

It seems kubevirt vm-launcher pods are also affected. https://github.com/kubevirt/kubevirt/blob/ea53cc9d444227a033c55d521979e6ccc688456f/pkg/virt-controller/services/template.go#L583

however, application state is healthy, just the pod is progressing

jkleinlercher avatar Apr 26 '24 15:04 jkleinlercher

Any updates on this issue?

ErwinJapie avatar Jun 27 '24 09:06 ErwinJapie

Im waiting on an update on this as well

pyang55 avatar Jul 26 '24 18:07 pyang55

No updates, you can safely assume that unless anyone has stated otherwise that this is still up for grabs and that there's no ETA for the resolution of this issue. Help is always welcome! πŸ™

blakepettersson avatar Jul 26 '24 18:07 blakepettersson

@blakepettersson I am just wondering why no one from Redhat tries to solve this because OpenShift Virtualization is affected by this problem and they push OpenShift Virtualization very much. See https://github.com/kubevirt/kubevirt/issues/11813. @terrytangyuan

jkleinlercher avatar Jul 26 '24 18:07 jkleinlercher

cc @jannfis to take a look and help connect with the right folks

terrytangyuan avatar Jul 26 '24 18:07 terrytangyuan

Are there any workarounds for this? Does anyone have an example custom heath check we can implement?

Liammarwood avatar Jul 31 '24 04:07 Liammarwood

Are there any workarounds for this?

Yes. The work-around is to not set restartPolicy: Never unless its in a Pod spawned by a Job (which is supposed to be marked as "Progressing" until the pod has terminated).

lindhe avatar Jul 31 '24 07:07 lindhe

Problem is that kubevirt uses it .. then it should get fixed in https://github.com/kubevirt/kubevirt/issues/11813

jkleinlercher avatar Jul 31 '24 07:07 jkleinlercher

Cool, I hadn't heard about a real-world application with this use-case before! That is interesting. My original example was literally just created as a dummy example to see if Argo CD worked, and I didn't imagine there being legitimate use-cases for restartPolicy: Never outside of jobs.

Can you, please, summarize how kubevirt makes use of it? I can read in the issue that it's the "virt-launcher", but I have no idea what that is. It's a one-time Pod but not a Job??

lindhe avatar Jul 31 '24 07:07 lindhe

It is a permanent running pod. Why it has restartPolicy never .. I don’t know. I would hope that some Redhat folks would have a look at it since they are heavy contributors in ArgoCD AND kubevirt and use both products in their OpenShift Virtualization offering.

jkleinlercher avatar Jul 31 '24 07:07 jkleinlercher

I wouldn't say not setting Restart policy to Never is a work around, a lot of operators are doing this outside of our powers. My example of seeing this is using Nifikop, but from other forums I know there are a few more of them.

Liammarwood avatar Jul 31 '24 07:07 Liammarwood

I wouldn't say not setting Restart policy to Never is a work around, a lot of operators are doing this outside of our powers. My example of seeing this is using Nifikop, but from other forums I know there are a few more of them.

yes, same with Apache Flink deployments managed by Flink Operator

EuGras avatar Jul 31 '24 14:07 EuGras

Maybe we can implement an annotation to stop this behaviour? Then we wouldn't have to break anything current?

Liammarwood avatar Jul 31 '24 20:07 Liammarwood