argo-cd
argo-cd copied to clipboard
Empty objects vs. null objects cause repeated syncs
Summary
When a config option is not set in the a kube manifest (e.g. statefulset) ArgoCD continually tries to re-sync the object because it sees a diff between an empty and undefined object.
Motivation
I have a statefulset that does not define the resources
section. ArgoCD continually tries to sync the object because Kubernetes automatically injects an empty object.
Proposal
The diff calculator should ignore the difference between an empty and null object in a kubernetes resource field.
Related
https://github.com/argoproj/argo-cd/issues/15554
Is this with Server Side Diff on or off? I would imagine looking at https://argo-cd.readthedocs.io/en/stable/user-guide/diff-strategies/#server-side-diff may help with scenarios like this. I don't believe there's plans to make the default Server Side Diff until a 3.x release but this may be something you want to take a look at...
Is this with Server Side Diff on or off? I would imagine looking at https://argo-cd.readthedocs.io/en/stable/user-guide/diff-strategies/#server-side-diff may help with scenarios like this. I don't believe there's plans to make the default Server Side Diff until a 3.x release but this may be something you want to take a look at...
i just tried enabling that and it didn't seem to make a difference on my environment
FWIW, this works for me for the "resources: {}" diff
ignoreDifferences:
- group: apps
kind: Deployment
jqPathExpressions:
- '.spec.template.spec.containers[].resources'
- group: apps
kind: StatefulSet
jqPathExpressions:
- '.spec.template.spec.containers[].resources'
Is this with Server Side Diff on or off?
I haven't set any options for server side diff in the application or the main configmap, so it should be off. On the application I only have these sync options set.
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
@evanrich Thanks, I was hoping I wouldn't have to ignore resources for the cases where there are real differences, but I can probably use that for now.
This is very strange, I have multiple clusters with basically the same configuration. AWS EKS (same region) Kube version: 1.28.8 ArgoCD version: v2.10.7+b060053
Same application (external-secrets) deployed to both clusters. One of the clusters shows a diff on the empty vs. null resources and continually tries to resync. The other cluster shows the same difference if you look at the live manifest vs. desired manifest, but does not display anything on the diff tab and does not try to re-sync.
I verified that on the cluster which is not showing a diff and therefore not syncing has resources: {}
in the live manifest and nothing defined for resources in the desired manifest. So for some reason this cluster is successfully ignoring that difference even though I didn't configure any ignoreDifferences
.
So same application running in the same version of Argo and Kubernetes seems to sometimes ignore this difference and sometimes not ignore it.
Noticed this problem with version 2.11 quite a lot
Adding this to ignoreDifferences doesn't seem like a good idea to me
This has suddenly emerged in my k3s cluster as well.
Kubernetes version: v1.30.0+k3s1 Argo: v2.11.2
Similar issue: https://github.com/argoproj/argo-cd/issues/13004
Just wanted to note that I also tried enabling server side diff, but it didn't have any effect on this issue.
I haven't seen any updates in the release notes for recent releases that indicate this has been changed, so I'm staying on helm chart version 6.7.18 (argo v2.10.9) hopefully this gets addressed/commented on soon because it's not just empty objects but the a ton of other resources as well are suddenly throwing sync errors in latest versions.
Same problem here.
Kubernetes: k3s 1.30.0+k3s1
ArgoCD: v2.11.2+25f7504
Like others, I have used ignoreDifferences
as a workaround for now, but the fix will be very useful :)
If someone doesn't want to list all kinds, they can use *
:
ignoreDifferences:
- group: apps
kind: "*"
jqPathExpressions:
- ".spec.template.spec.containers[].resources"
- group: apps
kind: "*"
jqPathExpressions:
- ".spec.template.spec.initContainers[].resources"
Comparing a working vs. non-working cluster, I found that the gvkParser is nil on the non-working cluster (the one where the empty resources is detected as a difference). Looks like the gvkParser is not created due to some invalid openapi schema in the cluster. Unfortunately the error message was not printing correctly so it took a while to track this down. https://github.com/argoproj/gitops-engine/pull/585
With some changes to the code I see this error message in the logs on the cluster that's not working:
"error creating gvk parser: duplicate entry for /v1, Kind=APIResourceList" server="https://kubernetes.default.svc"
See also: https://github.com/argoproj/gitops-engine/issues/425
This seems to be a known issue in the GVKParser: https://github.com/kubernetes/kubernetes/issues/103597
The issue in my case turned out to be caused by an older version of the kubernetes metrics server installed in the EKS cluster. First I updated updated the application controller using a locally built container (see https://github.com/argoproj/gitops-engine/pull/585) to display the parser error.
duplicate entry for /v1, Kind=APIResourceList"
This duplication comes from the kube api server having multiple versions of the APIResourceList definition, which can be seen in the raw openapi schema.
kubectl get --raw /openapi/v2 | jq -r '.definitions | keys | .[]' | sort | grep io.k8s.apimachinery.pkg.apis.meta.v1
Then I searched for where APIResourceList_v2
is referenced.
kubectl get --raw /openapi/v2 | jq | grep "io.k8s.apimachinery.pkg.apis.meta.v1.APIResourceList_v2" -B 25
The only place it was referenced was the metrics server api endpoint /apis/metrics.k8s.io/v1beta1/
.
After applying a recent metrics server config, the issue seems to be resolved.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.7.1/components.yaml
After the metrics update removed APIResourceList_v2
the issue went away and Argo gives a clean diff. Even though the cluster has some other definitions with a _v2
, those don't cause a problem for the parser, probably because they are pruned (https://github.com/kubernetes/kubernetes/pull/110179).
I don't know if this is the same cause for other people that are having this issue, but hopefully these steps can help to debug this.
Edit: also wanted to note that after applying the newer metrics server it takes a few minutes before the openapi schema is updated, and then you'll also need to restart the argocd-application-controller pod.
Interesting. I've seen similar duplications in k8s 1.30.0, so I'll be diving into the problem more. Glad you were able to sort it in your instance!
Interesting. I've seen similar duplications in k8s 1.30.0, so I'll be diving into the problem more. Glad you were able to sort it in your instance!
Same here, fresh k3s deployment ships with docker.io/rancher/mirrored-metrics-server:v0.7.0
and still faces this issue.
I'm seeing the same since (I think) I upgraded from k8s 1.29.* to 1.30.1. Most of my helm charts render resources: {}
when not specifying any, so it only affects two charts for me: kube-prometheus-stack and mariadb-operator.
I can also confirm: server side diff doesn't change anything.
This disappeared for me with the upgrade to 1.30.2
@pschichtel interesting... was there a change in version of metrics-server? (Or of any other aggregated API?)
@crenshaw-dev metrics-server was updated as part of the 1.29.* -> 1.30.1 upgrade, but it hasn't changed when upgrading to 1.30.2
That's strange... are you able to port-forward your metrics-server service and dump the /openapi/v2
output from that service?
not sure that's helpful
curl -k https://localhost:10250/openapi/v2
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/openapi/v2\"",
"reason": "Forbidden",
"details": {},
"code": 403
}
I just checked argo again and the issue actually came back, sorry for the confusion!
Okay that makes more sense, I'm not seeing any relevant changes in k8s or metrics-server.
I spent a bunch of time today trying to de-dupe OpenAPI models before feeding them to GVKParser. It avoids the NewGVKParser error, but apparently some models (like DeploymentSpec
are disappearing from the GVKParser's schema), so I'm getting other failures during diff construction.
Working on it though.
I'm working on the fix and have written up a description of the issue here: https://github.com/argoproj/gitops-engine/pull/587/files#diff-bbeadede3a86f27922253617b66c49b7672bb52896a10cf5626b398fa0fa9280
I'm trying to decide between just logging the dropped duplicate GVKs or also surfacing to the user that their diff might be wrong due to duplicate GVKs. So far I'm trying to surface it to the user, but might end up being more code than it's worth.
A couple open questions:
- Should aggregated APIs like metrics-server even include definitions for things like APIResourceList in their OpenAPI doc? afaik, that type isn't directly used in any of the API's user-facing functions. Maybe it's for Kubernetes' benefit, doing discovery?
- Would constructing
proto.Models
from an OpenAPI v3 doc make these problems go away? I haven't tested, because I'm not sure how to construct a full OpenAPI v3 doc from the per-type docs currently served by k8s. Even if I did construct that doc, I'm not confident that it wouldn't hit the same duplicate GVK issue when constructing the GVKParser.
The issue seems to still be there for Rollout, analysisRunMetadata
field (see the screenshot).
I've build a custom image from latest master (+1 cherry pick), ArgoCD pods were updated, even Redis ones.
Update: I'd wait for a bit in case it's a stale UI issue.
The app eventually got in sync.
I can also confirm the issue is gone. thanks!