argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

If any conversion webhook on any CRD isn't available, all apps on the cluster go to an "unknown" state.

Open johnthompson-ybor opened this issue 1 year ago • 10 comments

Checklist:

  • [x ] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [x ] I've included steps to reproduce the bug.
  • [ x] I've pasted the output of argocd version.

Describe the bug

Argocd version: v2.12.4+27d1e64

if you install any CRDs on the clusters with conversion webhooks, and the conversion webhook is down, then all applications on the cluster go to an Unknown or an error state:

Failed to load target state: failed to get cluster version for cluster "": failed to get cluster info for """: error synchronizing cache state : failed to sync cluster ": failed to load initial state of resource BucketServerSideEncryptionConfiguration.s3.aws.upbound.io: conversion webhook for s3.aws.upbound.io/v1beta1, Kind=BucketServerSideEncryptionConfiguration failed: Post "https://provider-aws-s3.crossplane-system.svc:9443/convert?timeout=30s": no endpoints available for service "provider-aws-s3"

If I have SSA on, the UI just gets stuck in "refreshing" and there's a nil pointer exception in the logs.

time="2024-11-18T14:19:18Z" level=error msg="Recovered from panic: runtime error: invalid memory address or nil pointer dereference

goroutine 294 [running]: runtime/debug.Stack() /usr/local/go/src/runtime/debug/stack.go:24 +0x5e

github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).processAppRefreshQueueItem.func1() /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:1480 +0x54

panic({0x382cd20?, 0x7756330?}) /usr/local/go/src/runtime/panic.go:770 +0x132

github.com/argoproj/argo-cd/v2/controller.(*appStateManager).CompareAppState(0xc00055cd20, 0xc0dae6a408, 0xc0a7114488, {0xc0a792d6c0, 0x1, 0x1}, {0xc0a7920700, 0x1, 0x1}, 0x0, ...) /go/src/github.com/argoproj/argo-cd/controller/state.go:864 +0x5ff9

github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).processAppRefreshQueueItem(0xc0004dec40) /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:1590 +0x1188

github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).Run.func3() /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:830 +0x25

k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226 +0x33

k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000636b00, {0x5555d00, 0xc001cec2a0}, 0x1, 0xc000081f80) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227 +0xaf

k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000636b00, 0x3b9aca00, 0x0, 0x1, 0xc000081f80) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:204 +0x7f

k8s.io/apimachinery/pkg/util/wait.Until(...) /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:161

created by github.com/argoproj/argo-cd/v2/controller.(*ApplicationController).Run in goroutine 112 /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:829 +0x865

To Reproduce

Install a CRD with a conversion webhook that goes to an unavailable endpoint.

Expected behavior

I'm not sure what the expected behavior should be. I don't think there should be a NPE when it happens in SSA at the very least.

It would be nice to be able to exclude those resources on an app by app basis, or be able to skip any resources that aren't included in the application? It basically means that if I need to do a new sync to fix the webhook, I can't really do it.

Screenshots

Version

Paste the output from `argocd version` here.

Argocd version: v2.12.4+27d1e64

Logs

Paste any relevant application logs here.

johnthompson-ybor avatar Nov 18 '24 15:11 johnthompson-ybor

Are you sure the controller is on 2.12.4? Not sure how this line can throw a nil pointer exception:

https://github.com/argoproj/argo-cd/blob/27d1e641b6ea99d9f4bf788c032aeaeefd782910/controller/state.go#L864

crenshaw-dev avatar Nov 18 '24 16:11 crenshaw-dev

I thought the same thing, but I just confirmed and that's what version I'm on.

johnthompson-ybor avatar Nov 20 '24 14:11 johnthompson-ybor

Just to clarify, argocd version outputs versions for argocd and argocd server. The first one is cli, the 2nd one is server-side. We need the 2nd one. Sorry if you already checked that and that's also 2.12.4.

andrii-korotkov-verkada avatar Nov 21 '24 04:11 andrii-korotkov-verkada

Tho maybe we have some memory corruption.

andrii-korotkov-verkada avatar Nov 21 '24 05:11 andrii-korotkov-verkada

Can you try with v2.13.1, please?

andrii-korotkov-verkada avatar Nov 22 '24 12:11 andrii-korotkov-verkada

Hello, any update on this issue. We are also facing this issue when we are trying to upgrade karpenter-crds with webhook enabled and webhook which has not yet been installed on the cluster causes all the apps to go into unknown state.

tmoreadobe avatar Jan 13 '25 19:01 tmoreadobe

We see this issue even with v2.13.2

tmoreadobe avatar Jan 13 '25 19:01 tmoreadobe

any update on this? we're facing this exactly same issue, making our clusters going potato in such cases..

mycodeself avatar Feb 03 '25 13:02 mycodeself

Hi @andrii-korotkov-verkada @crenshaw-dev, is there any reason why CRD version which is served but not stored is preferred over the version that is stored? We're running into the same issue (with a different error message) on v2.13.4

e.g. in our case it is BucketLifecycleConfiguration.s3.aws.upbound.io, which has storage version v1beta1 (see source), but ArgoCD is requesting v1beta2, which is visible in the error message, as well as in the "Live manifest" tab in web UI

EDIT: seems like ArgoCD relies on auto discovery, which uses preferredVersion field:

> kubectl get --raw /apis | yq -P
  ...
  - name: s3.aws.upbound.io
    versions:
      - groupVersion: s3.aws.upbound.io/v1beta2
        version: v1beta2
      - groupVersion: s3.aws.upbound.io/v1beta1
        version: v1beta1
    preferredVersion:
      groupVersion: s3.aws.upbound.io/v1beta2
      version: v1beta2

m1so avatar Feb 06 '25 08:02 m1so

Hi team - Any update on this? We are hitting this with some crossplane providers. When there is an issue with a conversion webhook it breaks the whole state of the target cluster.

sbyrne-teck avatar Jun 02 '25 20:06 sbyrne-teck

I submitted a first go at fixing this. Feedback much appreciated.

jcogilvie avatar Jun 16 '25 17:06 jcogilvie

There's a general class of issues where "the gitops-engine cluster cache couldn't be populated." A common case is when lists take a long time, and the pagination token expires. When those failures happen, the cluster cache is marked as failed.

I think it would be valuable for the cluster cache to have a "tainted" state where we know we couldn't process all resource kinds, but we're still going to operate on a best-effort basis. We'd want to communicate that state all the way up to the Application's that deploy to the tainted cluster, so that the users know things may go wrong (for example, some resources may be missing from the resource tree).

I think it would also be useful to store information about which resource kinds are tainted. So if an Application deploys to a tainted cluster, it shows a warning, but if an Application directly manages a tainted resource kind via GitOps, we take more drastic measures such as marking the Application as in a broken state and refusing to do sync operations.

This approach would require thinking carefully about what could go wrong when operating on a partial cluster cache. The things that come to mind immediately are:

  1. Resources may be missing from the resource tree
  2. Diffs may be incorrect

The graceful degradation approach is obviously only viable if we don't introduce any security issues. I can't think of any at the moment.

crenshaw-dev avatar Jul 31 '25 15:07 crenshaw-dev

This occurs under the conditions where somebody is evolving their CRDs:

  • v1 is the storage version
  • v1 and v2 are both served
  • a conversion webhook is added

If the conversion webhook goes down, argo can't populate its cluster cache due to the failure of the webhook.

This is distinct from the case where v2 is the storage version, in which case the failure of the webhook only breaks apps with instances of the CR.

Here is a full reproduction case with manifests, instructions, and a webhook image.

https://github.com/jcogilvie/conversion-webhook-repro

jcogilvie avatar Aug 07 '25 17:08 jcogilvie

After another pass, my fix is much more robust now in accordance with the above comment. PTAL if you have the time.

jcogilvie avatar Sep 02 '25 16:09 jcogilvie