argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

ApplicationSet resources experience data corruption

Open dsiebel opened this issue 3 months ago • 15 comments

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [x] I've included steps to reproduce the bug.
  • [ ] I've pasted the output of argocd version.

Describe the bug

ApplicationSet resources experience data corruption where:

  • syncPolicy becomes an empty object: syncPolicy: {}
  • generator contains misplaced template objects with metadata and spec fields (that look like they should be at the root level)

Example (excerpt):

spec:
  syncPolicy: {}  # Should contain preserveResourcesOnDeletion: false
  generators:
  - pullRequest:
      # Normal pullRequest config
      template:        # This should NOT be here - belongs at root level
        metadata: {...}
        spec: {...}

This seems to happen randomly. ApplicationSets look correct after initial deployment, then get corrupted later without apparent cause. It does not happen on all (349) ApplicationSets, but only a subset (~60). We were not able to identify a pattern. These ~60 affected ApplicationSets are some of our "preview" environments, of which there are 74, so some remain unaffected.

Affected ApplicationSets Pattern

60+ ApplicationSets are affected, all are using:

  • pullRequest generators (both standalone and in matrix combinations)
  • Various generator combinations:
    • Direct pullRequest generators
    • matrix with pullRequest + git
    • matrix with list + pullRequest
  • All use goTemplate: true
  • All have preserveResourcesOnDeletion: false

But: there are other ApplicationSets that use these combinations and are not affected.

Inspecting the affected ApplicationSet resources in cluster, specifically the managedFields section, we could see that at least the generators field is managed by the application-set-controller:

  - apiVersion: argoproj.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:generators: {}
        f:template:
          f:spec:
            f:source:
              f:directory:
                f:jsonnet: {}
    manager: argocd-applicationset-controller
    operation: Update
    time: "2025-09-03T15:54:28Z"

We have been debugging this issue for several days now, including extensive vibe-coding sessions to identify patterns, but still have no idea what might be causing this.

To Reproduce

  • ApplicationSets deploy successfully initially
  • Corruption occurs spontaneously (during controller reconciliation cycles?)
  • No user action triggers the corruption, it seems
  • Pattern affects only some ApplicationSets, not all

Version

We're not using argocd CLI since we don't usually have direct access to ArgoCD. Version running is v3.0.11+240a183, deployed to Kubernetes using the community helm Chart.

Logs

No relevant logs on the ApplicationSet controller..

dsiebel avatar Sep 03 '25 20:09 dsiebel

@dsiebel Thank you for reporting this.

What you're experiencing looks rather strange, several questions regarding your configuration:

Do you mean that your initial manifests do not have spec.generators[0].pullRequest.template set, and then you see it having been filled with spec and metadata from the root level of the same ApplicationSet? Or some other data?

It would be great if you could post some complete examples of the manifests before and after corruption. It would help getting an idea what is happening. Please don't forget to edit out sensitive information.

Some other questions:

  • Did you start experiencing the issue after an upgrade of ArgoCD version?
  • Are you managing your ApplicationSets with ArgoCD, helm, anything?
  • Have you considered using audit logs to understand what is changing the ApplicationSet manifests?
  • Does the corruption happen when ApplicationSet controller is disabled?

dudinea avatar Sep 04 '25 01:09 dudinea

@dudinea Thanks for getting back to me!

Here are some example ApplicationSets (shortened for readability) that are affected by this:

Raw YAML
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: <redacted>
  namespace: argocd
spec:
  goTemplate: true
  syncPolicy:
    preserveResourcesOnDeletion: false
  generators:
  - pullRequest: 
      github:
        # GitHub PR config with appSecretName and labels
  template:
    metadata:
      name: <redacted>-{{.number}}
    spec:
      project: <redacted>
      source:
        directory:
          include: '{*.yml,*.yaml}'
        repoURL: <redacted>
        targetRevision: prod # using branch tracking
        path: manifests/<redacted>/{{.number}}
      destination:
        name: <redacted>
      syncPolicy:
        automated:
          prune: true
          selfHeal: false
          allowEmpty: true
In-cluster manifest (after corrution)
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: <redacted>
  namespace: argocd
spec:
  generators:
  - pullRequest:
      github:
        # GitHub PR config with appSecretName and labels
      template: # <-- this shouldn't be here!
        metadata: {}
        spec:
          destination: {}
          project: ""
  goTemplate: true
  syncPolicy: {} # <-- this is now empty
  template:
    metadata:
      name: <redacted>-{{.number}}
    spec:
      destination:
        name: <redacted>
      project: <redacted>
      source:
        directory:
          include: '{*.yml,*.yaml}'
          jsonnet: {}
        path: manifests/<redacted>/{{.number}}
        repoURL: https://github.com/<redacted>
        targetRevision: prod # using branch trackinggitops/concept-deals-search
      syncPolicy:
        automated:
          allowEmpty: true
          prune: true
status:
  conditions:
  - lastTransitionTime: "2025-09-03T09:12:13Z"
    message: Successfully generated parameters for all Applications
    reason: ApplicationSetUpToDate
    status: "False"
    type: ErrorOccurred
  - lastTransitionTime: "2025-09-03T09:12:13Z"
    message: Successfully generated parameters for all Applications
    reason: ParametersGenerated
    status: "True"
    type: ParametersGenerated
  - lastTransitionTime: "2025-09-03T09:12:13Z"
    message: ApplicationSet up to date
    reason: ApplicationSetUpToDate
    status: "True"
    type: ResourcesUpToDate
  resources:
  - group: argoproj.io
    health:
      lastTransitionTime: "2025-09-03T07:12:50Z"
      status: Healthy
    kind: Application
    name: <redacted>-13034
    namespace: argocd
    status: Synced
    version: v1alpha1

dsiebel avatar Sep 04 '25 08:09 dsiebel

Regarding your questions / remarks:

Did you start experiencing the issue after an upgrade of ArgoCD version?

We are not quite sure. We started to notice after our upgrade to 3.0.11. But we also tried to switch to server-side-apply right around the same time and reverted it, since these exact fields were causing conflicts.

Are you managing your ApplicationSets with ArgoCD, helm, anything?

We are using Terraform to apply the raw manifests, using the alekc/kubectl provider. The main reason for doing so are dependencies that we can easily get via Terraform, like secrets, cluster credentials, etc.

Have you considered using audit logs to understand what is changing the ApplicationSet manifests?

Yes, but I couldn't get them to work. We watched the Kubernetes events and fieldManager section as an alternative.

Does the corruption happen when ApplicationSet controller is disabled?

That is an excellent point, I haven't thought of that yet. The problem here might be that the corruption only happens after "some time" or "some event", so we'd have to take it down for an unknown period of time and would block the entire company. We have not yet reproduced this issue in a lab / staging environment.

dsiebel avatar Sep 04 '25 08:09 dsiebel

In the meantime I found one issue that sounds very similar, at least for the template part being in the wrong place: https://github.com/argoproj/argo-cd/issues/18535

Maybe there's a correlation..

dsiebel avatar Sep 04 '25 08:09 dsiebel

Does the corruption happen when ApplicationSet controller is disabled?

By now the application-set-controller has been disabled for 36h and there are no corrupted ApplicationSets so far. The deployment was scaled down Friday 22:00 CEST, so outside of office hours to not impact the daily business. But I think it's a strong indication that the corruption is caused by the ApplicationSet controller. We will keep it disabled for another 24h.

dsiebel avatar Sep 07 '25 08:09 dsiebel

We narrowed the cause of the issue down to the Webhook API of the application-set-controller. We left the application-set-controller disabled for an entire weekend (72h+) and nothing happened. Before scaling it up again, we disabled the ApplicationSet webhooks for the PR generator (we use this to cut down the start-up time for preview environments). The application-set-controller was running for another 4h without any data corruption. I then manually send a single pull_request to the the ApplicationSet webhook API (/api/webhook) and the data corruption happened a few seconds later.

I already went through the code a bit, but I couldn't find a specific place that might be responsible for this.

dsiebel avatar Sep 08 '25 11:09 dsiebel

FYI: we just finished upgrading to the latest ArgoCD v3.1.5, and the issue still exists in that one.

dsiebel avatar Sep 15 '25 15:09 dsiebel

We could confirm that it has to do with the way ApplicationSets are being updated by the Webhook handler: https://github.com/argoproj/argo-cd/blob/master/applicationset/webhook/webhook.go#L610-L620 SyncPolicy and generators.*.Template, are actually being sent to the kubeAPI in this "broken" form, because it sends the entire ApplicationSet struct to the kube API, including all the default fields like an empty SyncPolicy and generators.*.Template struct.

As far as we can tell, the only thing that is being patched in is the argocd.argoproj.io/application-set-refresh annotation. This could also be done using a partial Metadata Patch like so:

# import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

c.Patch(context.Background(), &metav1.PartialObjectMetadata{
		TypeMeta: metav1.TypeMeta{
			Kind:       "ApplicationSet",
			APIVersion: "argoproj.io/v1alpha1",
		},
		ObjectMeta: metav1.ObjectMeta{
			Name:      appSet.Name,
			Namespace: appSet.Namespace,
			Annotations: map[string]string{
				common.AnnotationApplicationSetRefresh: "true",
			},
		},
	}, client.Merge)
})

This potentially makes the retryOnConflict and the Get obsolete as well.


Just to clarify: Semantically, there is no real issue here. What we initially perceived as an issue is just the default values being rendered and applied to the cluster. They do show up as recurring diff on kubectl diff though and potentially also in ArgoCD when using e.g. App-of-AppSets to manage ApplicationSets.

dsiebel avatar Sep 16 '25 14:09 dsiebel

@dudinea I created a small draft PR to discuss the proposed fix: https://github.com/argoproj/argo-cd/pull/24586

dsiebel avatar Sep 16 '25 20:09 dsiebel

@dudinea, @crenshaw-dev Any chance you could have another look here and in the Draft PR? Anything missing to move this forward? Feedback is much appreciated!

dsiebel avatar Oct 02 '25 07:10 dsiebel

@dudinea, @crenshaw-dev (apologies for the repeated direct ping) It's been almost two months and this is still very much an issue for us. What can I do to move this forward?

dsiebel avatar Oct 29 '25 15:10 dsiebel

@dsiebel thank you for your PR and sorry for the delay, I somehow missed your first ping. I'll try to take a look at it tomorrow.

dudinea avatar Oct 29 '25 16:10 dudinea

Hi @dsiebel! Please see my comment in the PR. One more time sorry for the delays

dudinea avatar Nov 01 '25 09:11 dudinea

Hi @dudinea! Thanks for getting back to me! And no worries, I only managed to check in on this every few weeks myself. I replied in the PR.

dsiebel avatar Nov 04 '25 16:11 dsiebel

We added this to our helm values as a workaround for now:

    # ? https://github.com/argoproj/argo-cd/issues/24378 - ignoring for all generators and sub-generators
    resource.customizations.ignoreDifferences.argoproj.io_ApplicationSet: |
      jqPathExpressions:
        - .spec.generators[]?.[]?.template
        - .spec.generators[]?.[]?.generators[]?.[]?.template

philstevenson avatar Dec 01 '25 12:12 philstevenson