argo-cd
argo-cd copied to clipboard
App health stuck at "Progressing" forever when pod has `restartPolicy: Never`
Checklist:
- [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- [x] I've included steps to reproduce the bug.
- [x] I've pasted the output of
argocd version
.
Describe the bug
When an Application deploys a pod that has restartPolicy: Never
, the Application gets stuck with health "Progressing" seemingly forever.
To Reproduce
Apply this manifest:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
finalizers:
- resources-finalizer.argocd.argoproj.io
name: debug
namespace: argocd
spec:
destination:
server: https://kubernetes.default.svc
namespace: default
source:
path: manifests
repoURL: https://github.com/lindhe/debug-argocd-15317.git
That app installs just this seemingly innocuous pod:
apiVersion: v1
kind: Pod
metadata:
name: foo
spec:
containers:
- image: busybox
command: ['sh', '-c', 'sleep 3600']
name: bar
restartPolicy: Never
Expected behavior
The application health should soon become "Healthy".
Screenshots
Version
argocd: v2.7.9+0ee33e5
BuildDate: 2023-07-24T18:26:12Z
GitCommit: 0ee33e52dd1f1bb944488584fc6f854b929f1180
GitTreeState: clean
GoVersion: go1.19.11
Compiler: gc
Platform: linux/amd64
WARN[0000] Failed to invoke grpc call. Use flag --grpc-web in grpc calls. To avoid this warning message, use flag --grpc-web.
argocd-server: v2.8.0+804d4b8
Logs
No relevant logs that I've found.
Other information
Similar to #5620 and some other issues found by searching for "health progressing".
I'm running an RKE2 cluster with version v1.26.6+rke2r1
and Cilium 1.13.2
installed.
This seems to be by design, or rather there is nothing currently better than what there is (see the relevant snippet in gitops engine).
Ah, that's interesting. Thanks for finding it!
It sounds to me like ignoring all pods that sets restartPolicy: Never
or restartPolicy: OnFailure
is a very blunt tool to achieve those ends. Perhaps we can find a better way of doing things?
After reading the documentation, it is my understanding that all resource hooks have the argocd.argoproj.io/hook
annotation. Perhaps we could therefore instead look for pods that fulfills these requirements:
- has
restartPolicy: Never
orrestartPolicy: OnFailure
. - has the
argocd.argoproj.io/hook
annotation or is owned by an object with that annotation.
Do you think it sounds reasonable to look for an alternative solution like this, in order to avoid these unexpected behaviors?
Perhaps we could therefore instead look for pods that fulfills these requirements:
From my perspective that does sound reasonable. I suspect that there might be a caveat or two to this which would still make this a no-go. @jessesuen do you have any input regarding this?
@lindhe perhaps the way to move this forward is to come in one of the Argo CD Contributors meetings and make a proposal?
Sure, I'd love to. Where do I find the schedule?
You can see the agenda and Zoom link here. TLDR every Thursday at 17.15 CET
Great. I don't have time this week, but expect me soon! :)
Hey there, any updates on this one? @lindhe
Sorry, I dropped the ball on this one. π Q4 last year was hectic for me. Still plan to push for it on the contributors meeting, but if someone has time to join and discuss it before I have time, go for it! π
I also met this problem for flink task manager, it may affect the flink autoscale.
flink operator release-1.7.0
It seems kubevirt vm-launcher pods are also affected. https://github.com/kubevirt/kubevirt/blob/ea53cc9d444227a033c55d521979e6ccc688456f/pkg/virt-controller/services/template.go#L583
however, application state is healthy, just the pod is progressing
Any updates on this issue?
Im waiting on an update on this as well
No updates, you can safely assume that unless anyone has stated otherwise that this is still up for grabs and that there's no ETA for the resolution of this issue. Help is always welcome! π
@blakepettersson I am just wondering why no one from Redhat tries to solve this because OpenShift Virtualization is affected by this problem and they push OpenShift Virtualization very much. See https://github.com/kubevirt/kubevirt/issues/11813. @terrytangyuan
cc @jannfis to take a look and help connect with the right folks
Are there any workarounds for this? Does anyone have an example custom heath check we can implement?
Are there any workarounds for this?
Yes. The work-around is to not set restartPolicy: Never
unless its in a Pod spawned by a Job (which is supposed to be marked as "Progressing" until the pod has terminated).
Problem is that kubevirt uses it .. then it should get fixed in https://github.com/kubevirt/kubevirt/issues/11813
Cool, I hadn't heard about a real-world application with this use-case before! That is interesting. My original example was literally just created as a dummy example to see if Argo CD worked, and I didn't imagine there being legitimate use-cases for restartPolicy: Never
outside of jobs.
Can you, please, summarize how kubevirt
makes use of it? I can read in the issue that it's the "virt-launcher", but I have no idea what that is. It's a one-time Pod but not a Job??
It is a permanent running pod. Why it has restartPolicy never .. I donβt know. I would hope that some Redhat folks would have a look at it since they are heavy contributors in ArgoCD AND kubevirt and use both products in their OpenShift Virtualization offering.
I wouldn't say not setting Restart policy to Never is a work around, a lot of operators are doing this outside of our powers. My example of seeing this is using Nifikop, but from other forums I know there are a few more of them.
I wouldn't say not setting Restart policy to Never is a work around, a lot of operators are doing this outside of our powers. My example of seeing this is using Nifikop, but from other forums I know there are a few more of them.
yes, same with Apache Flink deployments managed by Flink Operator
Maybe we can implement an annotation to stop this behaviour? Then we wouldn't have to break anything current?