argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Empty objects vs. null objects cause repeated syncs

Open pgier opened this issue 9 months ago • 7 comments

Summary

When a config option is not set in the a kube manifest (e.g. statefulset) ArgoCD continually tries to re-sync the object because it sees a diff between an empty and undefined object.

Motivation

I have a statefulset that does not define the resources section. ArgoCD continually tries to sync the object because Kubernetes automatically injects an empty object. Screenshot 2024-05-14 at 7 50 11 AM

Proposal

The diff calculator should ignore the difference between an empty and null object in a kubernetes resource field.

Related

https://github.com/argoproj/argo-cd/issues/15554

pgier avatar May 14 '24 12:05 pgier

Is this with Server Side Diff on or off? I would imagine looking at https://argo-cd.readthedocs.io/en/stable/user-guide/diff-strategies/#server-side-diff may help with scenarios like this. I don't believe there's plans to make the default Server Side Diff until a 3.x release but this may be something you want to take a look at...

blakeromano avatar May 15 '24 03:05 blakeromano

Is this with Server Side Diff on or off? I would imagine looking at https://argo-cd.readthedocs.io/en/stable/user-guide/diff-strategies/#server-side-diff may help with scenarios like this. I don't believe there's plans to make the default Server Side Diff until a 3.x release but this may be something you want to take a look at...

i just tried enabling that and it didn't seem to make a difference on my environment

evanrich avatar May 15 '24 20:05 evanrich

FWIW, this works for me for the "resources: {}" diff

      ignoreDifferences:
      - group: apps
        kind: Deployment
        jqPathExpressions:
        - '.spec.template.spec.containers[].resources'
      - group: apps
        kind: StatefulSet
        jqPathExpressions:
        - '.spec.template.spec.containers[].resources'

evanrich avatar May 15 '24 20:05 evanrich

Is this with Server Side Diff on or off?

I haven't set any options for server side diff in the application or the main configmap, so it should be off. On the application I only have these sync options set.

        syncOptions:
        - CreateNamespace=true
        - ServerSideApply=true

@evanrich Thanks, I was hoping I wouldn't have to ignore resources for the cases where there are real differences, but I can probably use that for now.

pgier avatar May 22 '24 15:05 pgier

This is very strange, I have multiple clusters with basically the same configuration. AWS EKS (same region) Kube version: 1.28.8 ArgoCD version: v2.10.7+b060053

Same application (external-secrets) deployed to both clusters. One of the clusters shows a diff on the empty vs. null resources and continually tries to resync. The other cluster shows the same difference if you look at the live manifest vs. desired manifest, but does not display anything on the diff tab and does not try to re-sync.

I verified that on the cluster which is not showing a diff and therefore not syncing has resources: {} in the live manifest and nothing defined for resources in the desired manifest. So for some reason this cluster is successfully ignoring that difference even though I didn't configure any ignoreDifferences.

So same application running in the same version of Argo and Kubernetes seems to sometimes ignore this difference and sometimes not ignore it.

pgier avatar May 22 '24 17:05 pgier

Noticed this problem with version 2.11 quite a lot

Adding this to ignoreDifferences doesn't seem like a good idea to me

batazor avatar May 24 '24 02:05 batazor

This has suddenly emerged in my k3s cluster as well.

Kubernetes version: v1.30.0+k3s1 Argo: v2.11.2

javydekoning avatar May 25 '24 18:05 javydekoning

Similar issue: https://github.com/argoproj/argo-cd/issues/13004

pgier avatar Jun 07 '24 14:06 pgier

Just wanted to note that I also tried enabling server side diff, but it didn't have any effect on this issue.

pgier avatar Jun 07 '24 16:06 pgier

I haven't seen any updates in the release notes for recent releases that indicate this has been changed, so I'm staying on helm chart version 6.7.18 (argo v2.10.9) hopefully this gets addressed/commented on soon because it's not just empty objects but the a ton of other resources as well are suddenly throwing sync errors in latest versions.

evanrich avatar Jun 07 '24 18:06 evanrich

Same problem here. Kubernetes: k3s 1.30.0+k3s1 ArgoCD: v2.11.2+25f7504

Like others, I have used ignoreDifferences as a workaround for now, but the fix will be very useful :) If someone doesn't want to list all kinds, they can use *:

ignoreDifferences:
  - group: apps
    kind: "*"
    jqPathExpressions:
      - ".spec.template.spec.containers[].resources"
  - group: apps
    kind: "*"
    jqPathExpressions:
      - ".spec.template.spec.initContainers[].resources"

vertisan avatar Jun 07 '24 18:06 vertisan

Comparing a working vs. non-working cluster, I found that the gvkParser is nil on the non-working cluster (the one where the empty resources is detected as a difference). Looks like the gvkParser is not created due to some invalid openapi schema in the cluster. Unfortunately the error message was not printing correctly so it took a while to track this down. https://github.com/argoproj/gitops-engine/pull/585

With some changes to the code I see this error message in the logs on the cluster that's not working:

"error creating gvk parser: duplicate entry for /v1, Kind=APIResourceList" server="https://kubernetes.default.svc"

See also: https://github.com/argoproj/gitops-engine/issues/425

This seems to be a known issue in the GVKParser: https://github.com/kubernetes/kubernetes/issues/103597

pgier avatar Jun 12 '24 18:06 pgier

The issue in my case turned out to be caused by an older version of the kubernetes metrics server installed in the EKS cluster. First I updated updated the application controller using a locally built container (see https://github.com/argoproj/gitops-engine/pull/585) to display the parser error.

duplicate entry for /v1, Kind=APIResourceList"

This duplication comes from the kube api server having multiple versions of the APIResourceList definition, which can be seen in the raw openapi schema.

kubectl get --raw /openapi/v2 | jq -r '.definitions | keys | .[]' | sort | grep io.k8s.apimachinery.pkg.apis.meta.v1

Then I searched for where APIResourceList_v2 is referenced.

kubectl get --raw /openapi/v2 | jq | grep "io.k8s.apimachinery.pkg.apis.meta.v1.APIResourceList_v2" -B 25

The only place it was referenced was the metrics server api endpoint /apis/metrics.k8s.io/v1beta1/. After applying a recent metrics server config, the issue seems to be resolved.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.7.1/components.yaml

After the metrics update removed APIResourceList_v2 the issue went away and Argo gives a clean diff. Even though the cluster has some other definitions with a _v2, those don't cause a problem for the parser, probably because they are pruned (https://github.com/kubernetes/kubernetes/pull/110179).
I don't know if this is the same cause for other people that are having this issue, but hopefully these steps can help to debug this.

Edit: also wanted to note that after applying the newer metrics server it takes a few minutes before the openapi schema is updated, and then you'll also need to restart the argocd-application-controller pod.

pgier avatar Jun 13 '24 14:06 pgier

Interesting. I've seen similar duplications in k8s 1.30.0, so I'll be diving into the problem more. Glad you were able to sort it in your instance!

crenshaw-dev avatar Jun 13 '24 14:06 crenshaw-dev

Interesting. I've seen similar duplications in k8s 1.30.0, so I'll be diving into the problem more. Glad you were able to sort it in your instance!

Same here, fresh k3s deployment ships with docker.io/rancher/mirrored-metrics-server:v0.7.0 and still faces this issue.

javydekoning avatar Jun 13 '24 14:06 javydekoning

I'm seeing the same since (I think) I upgraded from k8s 1.29.* to 1.30.1. Most of my helm charts render resources: {} when not specifying any, so it only affects two charts for me: kube-prometheus-stack and mariadb-operator. I can also confirm: server side diff doesn't change anything.

pschichtel avatar Jun 14 '24 00:06 pschichtel

This disappeared for me with the upgrade to 1.30.2

pschichtel avatar Jun 25 '24 17:06 pschichtel

@pschichtel interesting... was there a change in version of metrics-server? (Or of any other aggregated API?)

crenshaw-dev avatar Jun 25 '24 17:06 crenshaw-dev

@crenshaw-dev metrics-server was updated as part of the 1.29.* -> 1.30.1 upgrade, but it hasn't changed when upgrading to 1.30.2

pschichtel avatar Jun 25 '24 19:06 pschichtel

That's strange... are you able to port-forward your metrics-server service and dump the /openapi/v2 output from that service?

crenshaw-dev avatar Jun 25 '24 19:06 crenshaw-dev

not sure that's helpful

curl -k https://localhost:10250/openapi/v2
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/openapi/v2\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

pschichtel avatar Jun 25 '24 19:06 pschichtel

I just checked argo again and the issue actually came back, sorry for the confusion!

pschichtel avatar Jun 25 '24 20:06 pschichtel

Okay that makes more sense, I'm not seeing any relevant changes in k8s or metrics-server.

I spent a bunch of time today trying to de-dupe OpenAPI models before feeding them to GVKParser. It avoids the NewGVKParser error, but apparently some models (like DeploymentSpec are disappearing from the GVKParser's schema), so I'm getting other failures during diff construction.

Working on it though.

crenshaw-dev avatar Jun 25 '24 21:06 crenshaw-dev

I'm working on the fix and have written up a description of the issue here: https://github.com/argoproj/gitops-engine/pull/587/files#diff-bbeadede3a86f27922253617b66c49b7672bb52896a10cf5626b398fa0fa9280

I'm trying to decide between just logging the dropped duplicate GVKs or also surfacing to the user that their diff might be wrong due to duplicate GVKs. So far I'm trying to surface it to the user, but might end up being more code than it's worth.

A couple open questions:

  1. Should aggregated APIs like metrics-server even include definitions for things like APIResourceList in their OpenAPI doc? afaik, that type isn't directly used in any of the API's user-facing functions. Maybe it's for Kubernetes' benefit, doing discovery?
  2. Would constructing proto.Models from an OpenAPI v3 doc make these problems go away? I haven't tested, because I'm not sure how to construct a full OpenAPI v3 doc from the per-type docs currently served by k8s. Even if I did construct that doc, I'm not confident that it wouldn't hit the same duplicate GVK issue when constructing the GVKParser.

crenshaw-dev avatar Jun 26 '24 22:06 crenshaw-dev

The issue seems to still be there for Rollout, analysisRunMetadata field (see the screenshot). Screenshot 2024-06-30 at 8 34 28 AM I've build a custom image from latest master (+1 cherry pick), ArgoCD pods were updated, even Redis ones. Update: I'd wait for a bit in case it's a stale UI issue.

andrii-korotkov-verkada avatar Jun 30 '24 15:06 andrii-korotkov-verkada

The app eventually got in sync.

andrii-korotkov-verkada avatar Jun 30 '24 15:06 andrii-korotkov-verkada

I can also confirm the issue is gone. thanks!

pschichtel avatar Jul 03 '24 22:07 pschichtel