argo-cd
argo-cd copied to clipboard
ArgoCD 2.11 - Loop of PATCH calls to Application objects
Is it possible that https://github.com/argoproj/argo-cd/pull/18061 has suddenly changed the behavior of the Application Controller so that it isn't comparing things the same way? We are seeing a massive (relatively speaking) jump in the PATCH
requests/second that the ArgoCD App controller is making now across ~10 clusters that we did this upgrade on:
I suspect the issue has to do with how the code is comparing the valueFiles
lists from before and after.. we had some valueFiles
lists that had duplicate values.. and when we run kubectl get applications -w -o json | kubectl grep -w
, we see diff's like these:
revision: "794fcbae2671366663e107cd50079c2e96894591"
source:
helm:
ignoreMissingValueFiles: true
valueFiles:
- "values.yaml"
- "values.staging-xx.yaml"
- "values.staging-xx.yaml"
- "values/staging/values.yaml"
- "values/staging/staging-xx/values.yaml"
- "values/staging/staging-xx/types/values.otherfile.yaml"
- "values/staging/staging-xx/values.staging-xx.yaml"
- -
+ - "values.staging-xx.yaml"
values: "
cluster:
accountId: ".."
Originally posted by @diranged in https://github.com/argoproj/argo-cd/issues/18061#issuecomment-2102964797
We're reverting back to 2.10.9
... but this is happening on every cluster we have across the board. We caught this in staging, so ~10 clusters or so.
We just tried the upgrade again just for fun (no, I actually forgot about the issue and clicked the upgrade button ... :facepalm:) - and it's reliable, it happens again right away. Any suggestions on what to do to troubleshoot it?
thank you for reporting it @diranged ! trying to reproduce and hopefully get a fix soon
We see something similar to this. The patches only contain the status update, with an updated timestamp for reconciledAt. We have not tried rolling back to 2.10 but will probably give it a shot.
We're only seeing this for applications that use helm charts fetched from our helm repository, though. Helm charts fetched from git do not behave this way. Hope that helps
So we tried rolling back to 2.10.12, but the issue persisted. I'm guessing our problem is unrelated
@PaulFarver 2.10.10 onwards has the commit, can u try 2.10.9?
@PaulFarver 2.10.10 onwards has the commit, can u try 2.10.9?
Ah... my bad. We'll try rolling back to 2.10.9
@PaulFarver Did you roll back?
@alexmt Any progress? Can I help in any way try to reproduce this? This is a consistent problem for us on every cluster.. (though it seems to be better or worse on different Application
objects.. i have one right now that updates every single second)
One tip - I noticed on at least one of our applications, this problem is 10x worse (meaning it updates nearly every 300ms) when I have a Lua healthcheck looking at some custom crossplane resources we manage. If I remove that healthcehck, we go back to a reconciliation update every ~1-3 seconds.
Here's the example LUA:
"*.aws.xxx.com/*":
health.lua: |-
health_status = {
status = "Progressing",
message = "Provisioning ..."
}
if obj.status == nil or obj.status.conditions == nil then
return health_status
end
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "Synced" then
if condition.status == "False" then
health_status.status = "Degraded"
health_status.message = condition.message
return health_status
end
end
if condition.type == "Ready" then
if condition.status == "True" then
health_status.status = "Healthy"
health_status.message = "Resource is up-to-date."
return health_status
else
health_status.status = "Degraded"
health_status.message = condition.message
return health_status
end
end
end
return health_status
@alexmt,
I've been troubleshooting this for a while now and I found a few issues... I'm still digging, but there is definitely an issue with the Lua healthcheck scripts. When we have them in place, we have updates happening every second and sometimes more often:
+ reconciledAt: "2024-06-14T14:39:50Z"
+ reconciledAt: "2024-06-14T14:39:51Z"
+ reconciledAt: "2024-06-14T14:39:52Z"
+ reconciledAt: "2024-06-14T14:39:53Z"
+ reconciledAt: "2024-06-14T14:39:54Z"
+ reconciledAt: "2024-06-14T14:39:55Z"
+ reconciledAt: "2024-06-14T14:39:56Z"
+ reconciledAt: "2024-06-14T14:39:57Z"
+ reconciledAt: "2024-06-14T14:39:58Z"
+ reconciledAt: "2024-06-14T14:39:59Z"
+ reconciledAt: "2024-06-14T14:40:00Z"
+ reconciledAt: "2024-06-14T14:40:01Z"
+ reconciledAt: "2024-06-14T14:40:02Z"
If I disable the various healthcheck scripts, then the update rate drops dramatically:
+ reconciledAt: "2024-06-14T13:37:54Z"
+ reconciledAt: "2024-06-14T13:38:08Z"
+ reconciledAt: "2024-06-14T13:38:22Z"
+ reconciledAt: "2024-06-14T13:38:43Z"
+ reconciledAt: "2024-06-14T13:38:58Z"
This is not the only issue - but it's certainly contributing significantly.
Here's our current config:
k get cm argocd-cm -n argocd-system -o yaml | yq '.data."resource.customizations"'
apps/Deployment:
health.lua.useOpenLibs: true
health.lua: |-
hc = {}
if obj.spec.paused ~= nil and obj.spec.paused then
hc.status = "Suspended"
hc.message = "Deployment is paused"
return hc
end
if obj.metadata.generation <= obj.status.observedGeneration
then
if obj.status.conditions ~= nil and obj.status.replicas ~= nil and obj.status.updatedReplicas ~= nil then
for _, condition in ipairs(obj.status.conditions) do
if condition.reason == "ProgressDeadlineExceeded" then
hc.status = "Degraded"
hc.message = string.format("Deployment %s has exceeded its progress deadline", obj.metadata.name)
return hc
end
end
if obj.metadata.annotations["app.mydomain.com/deploymentHealthyPercentage"] ~= nil and obj.status.availableReplicas ~= nil then
current_deploy = (obj.status.availableReplicas/obj.status.replicas) * 100
threshold = obj.metadata.annotations["app.mydomain.com/deploymentHealthyPercentage"]
healthy_threshold = tonumber(threshold)
if (current_deploy ~= nil and healthy_threshold ~= nil and (current_deploy >= healthy_threshold)) then
hc.status = "Healthy"
hc.message = string.format("Healthy deployment threshold met with %d percent available replicas / desired replicas", healthy_threshold)
return hc
end
end
if obj.spec.replicas ~= nil and (obj.status.updatedReplicas < obj.spec.replicas) then
hc.status = "Progressing"
hc.message = string.format("Waiting for rollout to finish: %d out of %d new replicas have been updated", obj.status.updatedReplicas, obj.status.replicas)
return hc
end
if (obj.status.replicas > obj.status.updatedReplicas) then
old_replicas = obj.status.replicas - obj.status.updatedReplicas
hc.status= "Progressing"
hc.message = string.format("Waiting for rollout to finish: %d replicas are pending termination", old_replicas)
return hc
end
if (obj.status.availableReplicas ~= nil and obj.status.availableReplicas < obj.status.updatedReplicas) then
hc.status= "Progressing"
hc.message = string.format("Waiting for rollout to finish: %d of %d updated replicas are available...", obj.status.availableReplicas, obj.status.replicas)
return hc
end
end
else
hc.status = "Progressing"
hc.message = "Waiting for rollout to finish: observed deployment generation less than desired generation"
return hc
end
hc.status = "Healthy"
hc.message = ""
return hc
flink.apache.org/FlinkDeployment:
health.lua: |-
health_status = {}
if obj.status ~= nil and obj.status.reconciliationStatus ~= nil then
if obj.status.reconciliationStatus.state == "DEPLOYED" then
health_status.status = "Healthy"
return health_status
end
if obj.status.jobManagerDeploymentStatus == "DEPLOYED_NOT_READY" or obj.status.jobManagerDeploymentStatus == "DEPLOYING" then
health_status.status = "Progressing"
health_status.message = "Waiting for deploying"
return health_status
end
if obj.status.jobManagerDeploymentStatus == "ERROR" then
health_status.status = "Degraded"
health_status.message = obj.status.reconciliationStatus.error
return health_status
end
end
health_status.status = "Progressing"
health_status.message = "Waiting for Flink operator"
return health_status
"*.upbound.io/*":
health.lua: |-
health_status = {
status = "Progressing",
message = "Provisioning ..."
}
local function contains (table, val)
for i, v in ipairs(table) do
if v == val then
return true
end
end
return false
end
local has_no_status = {
"ProviderConfig",
"ProviderConfigUsage"
}
if obj.status == nil or next(obj.status) == nil and contains(has_no_status, obj.kind) then
health_status.status = "Healthy"
health_status.message = "Resource is up-to-date."
return health_status
end
if obj.status == nil or next(obj.status) == nil or obj.status.conditions == nil then
if obj.kind == "ProviderConfig" and obj.status.users ~= nil then
health_status.status = "Healthy"
health_status.message = "Resource is in use."
return health_status
end
return health_status
end
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "LastAsyncOperation" then
if condition.status == "False" then
health_status.status = "Degraded"
health_status.message = condition.message
return health_status
end
end
if condition.type == "Synced" then
if condition.status == "False" then
health_status.status = "Degraded"
health_status.message = condition.message
return health_status
end
end
if condition.type == "Ready" then
if condition.status == "True" then
health_status.status = "Healthy"
health_status.message = "Resource is up-to-date."
return health_status
end
end
end
return health_status
"*.crossplane.io/*":
health.lua: |-
health_status = {
status = "Progressing",
message = "Provisioning ..."
}
local function contains (table, val)
for i, v in ipairs(table) do
if v == val then
return true
end
end
return false
end
local has_no_status = {
"Composition",
"CompositionRevision",
"DeploymentRuntimeConfig",
"ControllerConfig",
"ProviderConfig",
"ProviderConfigUsage"
}
if obj.status == nil or next(obj.status) == nil and contains(has_no_status, obj.kind) then
health_status.status = "Healthy"
health_status.message = "Resource is up-to-date."
return health_status
end
if obj.status == nil or next(obj.status) == nil or obj.status.conditions == nil then
if obj.kind == "ProviderConfig" and obj.status.users ~= nil then
health_status.status = "Healthy"
health_status.message = "Resource is in use."
return health_status
end
return health_status
end
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "LastAsyncOperation" then
if condition.status == "False" then
health_status.status = "Degraded"
health_status.message = condition.message
return health_status
end
end
if condition.type == "Synced" then
if condition.status == "False" then
health_status.status = "Degraded"
health_status.message = condition.message
return health_status
end
end
if contains({"Ready", "Healthy", "Offered", "Established"}, condition.type) then
if condition.status == "True" then
health_status.status = "Healthy"
health_status.message = "Resource is up-to-date."
return health_status
end
end
end
return health_status
"*.aws.mydomain.com/*":
health.lua: |-
health_status = {
status = "Progressing",
message = "Provisioning ..."
}
if obj.status == nil or obj.status.conditions == nil then
return health_status
end
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "Synced" then
if condition.status == "False" then
health_status.status = "Degraded"
health_status.message = condition.message
return health_status
end
end
if condition.type == "Ready" then
if condition.status == "True" then
health_status.status = "Healthy"
health_status.message = "Resource is up-to-date."
return health_status
else
health_status.status = "Degraded"
health_status.message = condition.message
return health_status
end
end
end
return health_status
argoproj.io/Workflow:
health.lua: |2-
hs = {}
if obj.metadata.annotations["app.mydomain.com/suppressWorkflowHealthCheck"] == "true" then
hs.status = "Healthy"
return hs
end
if obj.status ~= nil then
if obj.status.phase ~= nil then
if obj.status.phase == "Succeeded" then
hs.status = "Healthy"
return hs
end
if obj.status.phase == "Failed" then
hs.status = "Degraded"
hs.message = obj.status.message
return hs
end
end
end
hs.status = "Progressing"
return hs
cert-manager.io/Certificate:
health.lua: |-
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "Ready" and condition.status == "False" then
hs.status = "Degraded"
hs.message = condition.message
return hs
end
if condition.type == "Ready" and condition.status == "True" then
hs.status = "Healthy"
hs.message = condition.message
return hs
end
end
end
end
hs.status = "Progressing"
hs.message = "Waiting for certificate"
return hs
health.certmanager.k8s.io/Certificate:
health.lua: |-
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "Ready" and condition.status == "False" then
hs.status = "Degraded"
hs.message = condition.message
return hs
end
if condition.type == "Ready" and condition.status == "True" then
hs.status = "Healthy"
hs.message = condition.message
return hs
end
end
end
end
hs.status = "Progressing"
hs.message = "Waiting for certificate"
return hs
Here's a graph of the log messages indicating the application controller is going to update the app... you can clearly see where I removed the lua healthchecks:
@alexmt After disabling the Lua healthchecks - I discovered finally a second but more critical issue:
So the first issue above is that we have Lua healthcheck script triggering some kind of reconciliation... the second issue is that the controller believes there is a "difference" between the desired and current status of the object. The issue starts in the needRefreshAppStatus()
function when it calls currentSourceEqualsSyncedSource()
which then calls app.Spec.Source.Equals(&app.Status.Sync.ComparedTo.Source)
.
The comparison here is really simple https://github.com/argoproj/argo-cd/blob/03e6f8ae3f19cdd9ba910bc88d421f519763b287/pkg/apis/application/v1alpha1/types.go#L2807-L2825
So I went and looked at the status.sync.comparedTo.source
field on the objects and I think I found the smoking gun (using dyff:
% k get application app -o yaml | yq .status.sync.comparedTo.source > comparedTo.yaml
% k get application app -o yaml | yq .spec.source > source.yaml
% % dyff between source.yaml comparedTo.yaml
_ __ __
_| |_ _ / _|/ _| between source.yaml
/ _' | | | | |_| |_ and comparedTo.v1.yaml
| (_| | |_| | _| _|
\__,_|\__, |_| |_| returned one difference
|___/
helm
+ two map entries added:
parameters:
- name: resources.requests.memory
│ value: 1Gi
- name: resources.requests.cpu
│ value: 500m
valuesObject:
│ cluster:
│ │ name: ...
...
The smoking gun here is that the objects .status.sync.comparedTo.source.parameters
and .status.sync.comparedTo.source.valuesObject
fields do not exist in the .spec.source
... they were fields we set at one point, but they are no longer there. Yet when the Application
gets updated by the argocd-controller
, those fields are not being purged.
I sanity checked a bunch of other applications .. and in every single case of an application that is in one of these loops, I found some field that exists in the .status.sync.comparedTo.source
but does not exist in .spec.source
.
I believe that the controller is not properly handling removed fields from an application spec, which then puts the applications into a bad state where the controller constantly sees a "diff".
@diranged I was actually working on an issue talking about this diffing issue, because I thought this was possibly a separate issue, but I'll put it in here.
We had migrated a bunch of applications to use helm repositories rather than git, which meant that we replaced the .path
in source with a .chart
, but after reconciling the .path
field was not unset in .status.sync.comparedTo
, which caused a massive amount of reconciles. Every application would get reconciled every 3 seconds.
We patched the status to remove the .path
field, which helped a bunch.
We still see some odd behaviour, if we use .valuesObject
to configure our applications. Instead of reconciling once per reconciliation period, we see one reconciliation and then immediately another. When looking in the log we also see the (spec.source differs)
statements.
I looked in the code, and I can see that the valuesObject
is a RawExtension
backed by a bytearray, and since the comparison is a reflect.DeepEqual
, anything like ordering or structuring could cause a diff.
I'm not sure why it doesn't spin out of control like in case of the .path
diff, though 🤔
@PaulFarver I couldn't figure out how to make kubectl patch
update the status to remove the old data.. can you post how you did that?
@diranged You'll have to use a replace operation with a json patch, and set the field to null.
I also ran into a race condition, since I was trying to update the resource while Argo was running wild, so sometimes Argo would overwrite my patch.
I would suggest scaling down Argo before patching
(sorry didnt meant to close this!)
@diranged I was actually working on an issue talking about this diffing issue, because I thought this was possibly a separate issue, but I'll put it in here.
We had migrated a bunch of applications to use helm repositories rather than git, which meant that we replaced the
.path
in source with a.chart
, but after reconciling the.path
field was not unset in.status.sync.comparedTo
, which caused a massive amount of reconciles. Every application would get reconciled every 3 seconds.We patched the status to remove the
.path
field, which helped a bunch.We still see some odd behaviour, if we use
.valuesObject
to configure our applications. Instead of reconciling once per reconciliation period, we see one reconciliation and then immediately another. When looking in the log we also see the(spec.source differs)
statements.I looked in the code, and I can see that the
valuesObject
is aRawExtension
backed by a bytearray, and since the comparison is areflect.DeepEqual
, anything like ordering or structuring could cause a diff.I'm not sure why it doesn't spin out of control like in case of the
.path
diff, though 🤔
reflect.DeepEqual considers nil slices/maps not equal to empty slices/maps I wonder if that is the case here.
https://github.com/argoproj/argo-cd/blob/57965576f01f77ade431b10c225e4509dd77b02b/applicationset/controllers/applicationset_controller.go#L1606
@alexmt, @rumstead, What's the right way to get some attention on this from the development team to find a fix?
@alexmt, @rumstead, What's the right way to get some attention on this from the development team to find a fix?
I would try slack or join one of the community meetings. Github is swamped.
@alexmt, @rumstead, What's the right way to get some attention on this from the development team to find a fix?
Yup slack or come to the office hours. https://github.com/argoproj/argo-cd?tab=readme-ov-file#contribution-discussion-and-support
I think this is a similar issue that we are starting to see - was there any discussion on this elsewhere out of GitHub that can be shared?
I believe that the controller is not properly handling removed fields from an application spec, which then puts the applications into a bad state where the controller constantly sees a "diff".
That's my observation as well. I'm seeing that .status.sync.comparedTo.source.directory
is sitting around even though it should be deleted.
@alexmt it's a bug in the Application controller, not the ApplicationSet controller.
Is the patch just... backwards? Like, are we diffing the live source and the comparedTo source and just getting the difference, no indication of whether the field is meant to be removed or added?
Yes we are seeing excessive CPU usage on the application controller post upgrade and a consistently high reconciliation rate
We have the same problem after upgrading to 2.11.3. We can easily reproduce the problem using "parameter override" in the UI. we use an app spec like this:
spec:
...
source:
helm:
releaseName: something
valueFiles:
- ../../values/file1.yaml
path: charts/istio-something
...
after parameter override the field "parameters" is added:
source:
helm:
parameters:
- name: 'myparam'
value: '5001'
releaseName: something
if you now remove the parameter override in the UI the section "parameters:" is removed in the app spec but it remains in the "comparedTo" section. --> ApplicationController enters a infinite reconcile - loop :-(
is there any workaround or fix already known?
We have the same problem after upgrading to 2.11.3. We can easily reproduce the problem using "parameter override" in the UI. we use an app spec like this:
spec: ... source: helm: releaseName: something valueFiles: - ../../values/file1.yaml path: charts/istio-something ...
after parameter override the field "parameters" is added:
source: helm: parameters: - name: 'myparam' value: '5001' releaseName: something
if you now remove the parameter override in the UI the section "parameters:" is removed in the app spec but it remains in the "comparedTo" section. --> ApplicationController enters a infinite reconcile - loop :-(
is there any workaround or fix already known?
Try either 1) setting parameters
to an empty array ([]
) or 2) re-creating the argocd app resource
I think you could clear the loop by blowing away the status
field and letting Argo CD repopulate it.
Thanks for your ideas, unfortunately we will have the loop again every time our developers override some parameters (in the UI) and remove that later. -> setting parameters to an empty array ([]) is maybe ok but it is a lot of work and generates tons of changes for all (productive) clusters (we are using the "app of apps" pattern) and we probably need that for "values", "valueFiles" and other (helm:) fields, too...
100%, it's not a permanent fix.
The problem seems to be how the controller constructs the status patch, particularly the part of the patch that contains a copy of the source field.
To fix the bug, we'll have to fix how the patch works.
Unfortunately I don't have time to dig in right now. It'll probably involve running a debugger and inspecting the JSON patch in the app controller.
We downgraded to 2.10.9 - now I cannot reproduce the problem anymore. It was still there in 2.10.10. So probably the assumption that #18061 is responsible is correct.
Hit the same issue when changing the destination from server
to name
. See #19194 for details.
Patching the apps seemed to resolve the issue for me in the end. As a one-liner (please verify before patching anything):
kubectl get app -o=jsonpath='{.items[?(@.status.sync.comparedTo.destination.server)].metadata.name}' | xargs -n1 -I{} kubectl patch app {} --type=json -p='[{"op": "remove", "path": "/status/sync/comparedTo/destination/server"}]'
Hey folks, just chiming in here to say we saw some similar issues on 2.11.7
Looping of app reconciliation calls, etc.
Downgrading to 2.10.9
has resolved the issue for now, but it would be great to see this fixed :D