If any conversion webhook on any CRD isn't available, all apps on the cluster go to an "unknown" state.
Checklist:
- [x ] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- [x ] I've included steps to reproduce the bug.
- [ x] I've pasted the output of
argocd version.
Describe the bug
Argocd version: v2.12.4+27d1e64
if you install any CRDs on the clusters with conversion webhooks, and the conversion webhook is down, then all applications on the cluster go to an Unknown or an error state:
Failed to load target state: failed to get cluster version for cluster "
If I have SSA on, the UI just gets stuck in "refreshing" and there's a nil pointer exception in the logs.
time="2024-11-18T14:19:18Z" level=error msg="Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 294 [running]: runtime/debug.Stack() /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).processAppRefreshQueueItem.func1() /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:1480 +0x54
panic({0x382cd20?, 0x7756330?}) /usr/local/go/src/runtime/panic.go:770 +0x132
github.com/argoproj/argo-cd/v2/controller.(*appStateManager).CompareAppState(0xc00055cd20, 0xc0dae6a408, 0xc0a7114488, {0xc0a792d6c0, 0x1, 0x1}, {0xc0a7920700, 0x1, 0x1}, 0x0, ...) /go/src/github.com/argoproj/argo-cd/controller/state.go:864 +0x5ff9
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).processAppRefreshQueueItem(0xc0004dec40) /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:1590 +0x1188
github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).Run.func3() /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:830 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000636b00, {0x5555d00, 0xc001cec2a0}, 0x1, 0xc000081f80) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000636b00, 0x3b9aca00, 0x0, 0x1, 0xc000081f80) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161
created by github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).Run in goroutine 112 /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:829 +0x865
To Reproduce
Install a CRD with a conversion webhook that goes to an unavailable endpoint.
Expected behavior
I'm not sure what the expected behavior should be. I don't think there should be a NPE when it happens in SSA at the very least.
It would be nice to be able to exclude those resources on an app by app basis, or be able to skip any resources that aren't included in the application? It basically means that if I need to do a new sync to fix the webhook, I can't really do it.
Screenshots
Version
Paste the output from `argocd version` here.
Argocd version: v2.12.4+27d1e64
Logs
Paste any relevant application logs here.
Are you sure the controller is on 2.12.4? Not sure how this line can throw a nil pointer exception:
https://github.com/argoproj/argo-cd/blob/27d1e641b6ea99d9f4bf788c032aeaeefd782910/controller/state.go#L864
I thought the same thing, but I just confirmed and that's what version I'm on.
Just to clarify, argocd version outputs versions for argocd and argocd server. The first one is cli, the 2nd one is server-side. We need the 2nd one. Sorry if you already checked that and that's also 2.12.4.
Tho maybe we have some memory corruption.
Can you try with v2.13.1, please?
Hello, any update on this issue. We are also facing this issue when we are trying to upgrade karpenter-crds with webhook enabled and webhook which has not yet been installed on the cluster causes all the apps to go into unknown state.
We see this issue even with v2.13.2
any update on this? we're facing this exactly same issue, making our clusters going potato in such cases..
Hi @andrii-korotkov-verkada @crenshaw-dev, is there any reason why CRD version which is served but not stored is preferred over the version that is stored? We're running into the same issue (with a different error message) on v2.13.4
e.g. in our case it is BucketLifecycleConfiguration.s3.aws.upbound.io, which has storage version v1beta1 (see source), but ArgoCD is requesting v1beta2, which is visible in the error message, as well as in the "Live manifest" tab in web UI
EDIT: seems like ArgoCD relies on auto discovery, which uses preferredVersion field:
> kubectl get --raw /apis | yq -P
...
- name: s3.aws.upbound.io
versions:
- groupVersion: s3.aws.upbound.io/v1beta2
version: v1beta2
- groupVersion: s3.aws.upbound.io/v1beta1
version: v1beta1
preferredVersion:
groupVersion: s3.aws.upbound.io/v1beta2
version: v1beta2
Hi team - Any update on this? We are hitting this with some crossplane providers. When there is an issue with a conversion webhook it breaks the whole state of the target cluster.
I submitted a first go at fixing this. Feedback much appreciated.
There's a general class of issues where "the gitops-engine cluster cache couldn't be populated." A common case is when lists take a long time, and the pagination token expires. When those failures happen, the cluster cache is marked as failed.
I think it would be valuable for the cluster cache to have a "tainted" state where we know we couldn't process all resource kinds, but we're still going to operate on a best-effort basis. We'd want to communicate that state all the way up to the Application's that deploy to the tainted cluster, so that the users know things may go wrong (for example, some resources may be missing from the resource tree).
I think it would also be useful to store information about which resource kinds are tainted. So if an Application deploys to a tainted cluster, it shows a warning, but if an Application directly manages a tainted resource kind via GitOps, we take more drastic measures such as marking the Application as in a broken state and refusing to do sync operations.
This approach would require thinking carefully about what could go wrong when operating on a partial cluster cache. The things that come to mind immediately are:
- Resources may be missing from the resource tree
- Diffs may be incorrect
The graceful degradation approach is obviously only viable if we don't introduce any security issues. I can't think of any at the moment.
This occurs under the conditions where somebody is evolving their CRDs:
- v1 is the storage version
- v1 and v2 are both served
- a conversion webhook is added
If the conversion webhook goes down, argo can't populate its cluster cache due to the failure of the webhook.
This is distinct from the case where v2 is the storage version, in which case the failure of the webhook only breaks apps with instances of the CR.
Here is a full reproduction case with manifests, instructions, and a webhook image.
https://github.com/jcogilvie/conversion-webhook-repro