argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Sync loop for Helm Applications that are using post-delete hooks

Open ZF-fredericvanlinthoudt opened this issue 1 year ago • 12 comments

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [x] I've included steps to reproduce the bug.
  • [x] I've pasted the output of argocd version.

Describe the bug

Since we've updated to ArgoCD v2.10.0, we are facing a constant refresh/sync issue with Applications that have a Helm template as source and are using "post-delete" hooks in Helm. Probably this is related to the new feature that added support for post-delete hooks. The application diff (see screenshot below) shows that it wants to two post-delete-finalizer.argocd.argoproj.io finalizers from the Application. This change gets synced but almost instantaneously it gets out-of-sync again with the same diff and repeats the same process over and over again. On our production ArgoCD instance, with more than 1200 applications, this causes ArgoCD to freeze and not sync any other applications anymore (those other application's sync are just stuck in "waiting to start").

To Reproduce

https://REDACTED.git is a placeholder for a GIT repository that contains directories with Applications

  • Have an ApplicationSet that generates Applications from underlying directories in a GIT repository with auto-sync enabled.
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  labels:
    argocd.argoproj.io/instance: appset-prd
    zf.argocd.ground-services: 'true'
    zf.argocd.group: optional-gloo-release
    zf.argocd.kind: application-set
    zf.argocd.stage: prd
  name: prd-optional-gloo-release
  namespace: argocd
spec:
  generators:
    - matrix:
        generators:
          - clusters:
              selector:
                matchLabels:
                  argocd.argoproj.io/secret-type: cluster
                  zf-gloo: 'true'
                  zf-kind: global
                  zf-stage: prd
          - git:
              directories:
                - path: charts/optional-releases/gloo-release/*
              repoURL: >-
                https://REDACTED.git
              revision: release/prd
  template:
    metadata:
      annotations:
        argocd.argoproj.io/manifest-generate-paths: .
      labels:
        zf.argocd.ground-services: 'true'
        zf.argocd.group: optional-releases
        zf.argocd.kind: app-of-application-set
        zf.argocd.stage: prd
      name: 'app-{{ name }}-{{ path.basename }}'
    spec:
      destination:
        namespace: argocd
        server: 'https://kubernetes.default.svc'
      project: 'project-ground-services-apps-{{ name }}'
      source:
        helm:
          parameters:
            - name: clusterName
              value: '{{ name }}'
            - name: destinationServer
              value: '{{ server }}'
            - name: branch
              value: release/prd
            - name: stage
              value: prd
        path: '{{ path }}'
        repoURL: >-
          https://REDACTED.git
        targetRevision: release/prd
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
  • Have an underlying directory in https://REDACTED.git containing a Helm application that uses post-delete hooks. We are using Gloo Edge Enterprise Helm chart v1.15.10 (https://storage.googleapis.com/gloo-ee-helm => gloo-ee).

Expected behavior

Applications that use post-delete Helm hooks should be synced successfully in one go and should not constantly be synced over and over again when auto-sync is enabled.

Screenshots

image

Version

argocd: v2.10.0+2175939.dirty
BuildDate: 2024-02-06T15:31:31Z
GitCommit: 2175939ed6156ddd743e60f427f7f48118c971bf
GitTreeState: dirty
GoVersion: go1.21.6
Compiler: gc
Platform: linux/amd64
argocd-server: v2.10.0+2175939

Logs

No relevant logs found.

ZF-fredericvanlinthoudt avatar Feb 07 '24 13:02 ZF-fredericvanlinthoudt

We also experienced this and since we have Argo CD installed via helm we had fun trying to rollback 😅

pohldk avatar Feb 07 '24 14:02 pohldk

On our production ArgoCD, with 1000+ applications, after updating to v2.10.0, the sync and refresh buttons completely freeze the UI. We noticed that the application controller used twice as much memory and cpu but also we didn't found any relevant logs. We had to rollback to v2.9.5.

tcpecheanu avatar Feb 08 '24 07:02 tcpecheanu

The sharding is not working in 2.10.0 as it was working in previous versions. If you try to remove env variable ARGOCD_CONTROLLER_REPLICAS and restart controller

You will see sync and refresh will start working again

AnubhavSabharwa avatar Feb 10 '24 15:02 AnubhavSabharwa

We experience the same sync loop issue with version 2.10.5. image

Anyone found a solution for this? Is it an option to add the 2 finalizers to the Application in git? Or would that break an initial deploy?

Skoucail avatar Apr 04 '24 12:04 Skoucail

Fixed by #18003 ?

joebowbeer avatar May 07 '24 22:05 joebowbeer

Hello, We started also seeing several Applications on ArgoCD being out of sync constantly with those 2 finalizers as diff. This started to happen after upgrading from version 2.9.6 to v2.11.0. After reverting v2.9.6 everything went back to normal. After the upgrade to v2.11.0 we started seeing every metrics going up (memory usage, CPU usage and also the queue times that were zero). The upgrade occurred around 9AM today May 20th image image image

After installing v2.9.6 everything went back to normal again, please ignore the gap between ~17:35 and ~18:00 we had an issue with the metrics collections. image image image image

It can be clearly seen that there is a spike in every metric of the application controller (CPU, RAM kubernetes executions) and a drop after reverting to v2.9.6. We could see an immediate increase in the queue time that remains at zero after reverting the version.

At the moment we have only these metrics for v2.9.6 and v2.11.0. For some reason with other versions our metrics agent is not being able to gather any information, will check what can be done and test with other different versions to see if this issue with the finalisers persists.

Thanks!

UPDATE Hello, Just to add more information, regarding the issue. It seems that v2.9.15 works as v2.9.6, trying out v2.10.10 caused the issues mentioned above so it must be something introduced in v2.10.x. As this version is installed we start seeing the queue increasing and the apps starting a sync loop. Thanks for the support

ricardojdsilva87 avatar May 20 '24 20:05 ricardojdsilva87

I'm on v2.11.2+25f7504 version and experience the same problems. I'm stuck on infinite loops if selfHeal is on. obraz

mmalyska avatar Jun 04 '24 12:06 mmalyska

I've installed the version below and I am facing the same issue: { "Version": "v2.11.3+3f344d5", "BuildDate": "2024-06-06T08:42:00Z", "GitCommit": "3f344d54a4e0bbbb4313e1c19cfe1e544b162598", "GitTreeState": "clean", "GoVersion": "go1.21.9", "Compiler": "gc", "Platform": "linux/amd64", "KustomizeVersion": "v5.2.1 2023-10-19T20:13:51Z", "HelmVersion": "v3.14.4+g81c902a", "KubectlVersion": "v0.26.11", "JsonnetVersion": "v0.20.0" }

argocd_issue

antonio-tolentino avatar Jun 26 '24 13:06 antonio-tolentino

got the same with nvidia gpu operator and self heal disabled don't change anything

didlawowo avatar Aug 25 '24 00:08 didlawowo

The same is still happening in the latest version v2.12.4: image

ricardojdsilva87 avatar Oct 10 '24 10:10 ricardojdsilva87

We are also experiencing this, is there a workaround for that?

gadiener avatar Oct 15 '24 14:10 gadiener

We're experiencing the same issue with the Falcon sensor, as mentioned in the previous comment. Could you please advise?

igorivan avatar Oct 15 '24 15:10 igorivan

Got also the same issue. Any tips on how to circumvent it?

wikka avatar Oct 16 '24 08:10 wikka

Hey, i found a possible mitigation in Issue-17433 This ticket is probably a duplicate to this ticket. TLDR; Just add the following to the argocd-cm to ignore differences in Argocd Applications source comment

resource.customizations.ignoreDifferences.argoproj.io_Application: |
  jqPathExpressions:
    - .metadata.finalizers[]? | select(. == "post-delete-finalizer.argocd.argoproj.io" or . == "post-delete-finalizer.argocd.argoproj.io/cleanup")
    - if (.metadata.finalizers | length) == 0 then .metadata.finalizers else empty end

lorenzboguhn avatar Oct 16 '24 09:10 lorenzboguhn

Hello, indeed the mentioned snippet stops the post-delete hooks to be considered as a diff. After enabling this setting the resource usage of the controller is not as high as mentioned before. image But the queue still increases: image We are using the ArgoCD datadog integration, so these metrics are directly reported by the ArgoCD pods. One metric that we can see that increased alot and might be related are these ones: image They seem to be related to the Repository server now. Could this be also related to the queue increasing? This might be also another issue not related to the post delete hook, but is just happening after upgrading to a release > 2.10.x. In this release the server-side diff feature was added, but as I know it is disabled by default on the configmap and enabling it with controller.diff.server.side documentation.

I'll post here if I can find anything else new

ricardojdsilva87 avatar Oct 16 '24 16:10 ricardojdsilva87