argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

ArgoCD 2.11 - Loop of PATCH calls to Application objects

Open diranged opened this issue 9 months ago • 2 comments

Is it possible that https://github.com/argoproj/argo-cd/pull/18061 has suddenly changed the behavior of the Application Controller so that it isn't comparing things the same way? We are seeing a massive (relatively speaking) jump in the PATCH requests/second that the ArgoCD App controller is making now across ~10 clusters that we did this upgrade on:

image

I suspect the issue has to do with how the code is comparing the valueFiles lists from before and after.. we had some valueFiles lists that had duplicate values.. and when we run kubectl get applications -w -o json | kubectl grep -w, we see diff's like these:

        revision: "794fcbae2671366663e107cd50079c2e96894591"
        source:
          helm:
            ignoreMissingValueFiles: true
            valueFiles:
              - "values.yaml"
              - "values.staging-xx.yaml"
              - "values.staging-xx.yaml"
              - "values/staging/values.yaml"
              - "values/staging/staging-xx/values.yaml"
              - "values/staging/staging-xx/types/values.otherfile.yaml"
              - "values/staging/staging-xx/values.staging-xx.yaml"
-             - 
+             - "values.staging-xx.yaml"
            values: "
cluster:
  accountId: ".."

Originally posted by @diranged in https://github.com/argoproj/argo-cd/issues/18061#issuecomment-2102964797

diranged avatar May 09 '24 16:05 diranged

We're reverting back to 2.10.9... but this is happening on every cluster we have across the board. We caught this in staging, so ~10 clusters or so.

diranged avatar May 09 '24 16:05 diranged

We just tried the upgrade again just for fun (no, I actually forgot about the issue and clicked the upgrade button ... :facepalm:) - and it's reliable, it happens again right away. Any suggestions on what to do to troubleshoot it?

diranged avatar May 23 '24 15:05 diranged

thank you for reporting it @diranged ! trying to reproduce and hopefully get a fix soon

alexmt avatar Jun 03 '24 17:06 alexmt

We see something similar to this. The patches only contain the status update, with an updated timestamp for reconciledAt. We have not tried rolling back to 2.10 but will probably give it a shot.

We're only seeing this for applications that use helm charts fetched from our helm repository, though. Helm charts fetched from git do not behave this way. Hope that helps

PaulFarver avatar Jun 06 '24 06:06 PaulFarver

So we tried rolling back to 2.10.12, but the issue persisted. I'm guessing our problem is unrelated

PaulFarver avatar Jun 07 '24 10:06 PaulFarver

@PaulFarver 2.10.10 onwards has the commit, can u try 2.10.9?

tooptoop4 avatar Jun 08 '24 04:06 tooptoop4

@PaulFarver 2.10.10 onwards has the commit, can u try 2.10.9?

Ah... my bad. We'll try rolling back to 2.10.9

PaulFarver avatar Jun 10 '24 04:06 PaulFarver

@PaulFarver Did you roll back? @alexmt Any progress? Can I help in any way try to reproduce this? This is a consistent problem for us on every cluster.. (though it seems to be better or worse on different Application objects.. i have one right now that updates every single second)

One tip - I noticed on at least one of our applications, this problem is 10x worse (meaning it updates nearly every 300ms) when I have a Lua healthcheck looking at some custom crossplane resources we manage. If I remove that healthcehck, we go back to a reconciliation update every ~1-3 seconds.

Here's the example LUA:

      "*.aws.xxx.com/*":
        health.lua: |-
          health_status = {
            status = "Progressing",
            message = "Provisioning ..."
          }
          
          if obj.status == nil or obj.status.conditions == nil then
            return health_status
          end
            
          for i, condition in ipairs(obj.status.conditions) do
            if condition.type == "Synced" then
              if condition.status == "False" then
                health_status.status = "Degraded"
                health_status.message = condition.message
                return health_status 
              end
            end
                
            if condition.type == "Ready" then
              if condition.status == "True" then
                health_status.status = "Healthy"
                health_status.message = "Resource is up-to-date."
                return health_status
              else
                health_status.status = "Degraded"
                health_status.message = condition.message
                return health_status
              end
            end
          end

          return health_status

diranged avatar Jun 14 '24 13:06 diranged

@alexmt,

I've been troubleshooting this for a while now and I found a few issues... I'm still digging, but there is definitely an issue with the Lua healthcheck scripts. When we have them in place, we have updates happening every second and sometimes more often:

+   reconciledAt: "2024-06-14T14:39:50Z"
+   reconciledAt: "2024-06-14T14:39:51Z"
+   reconciledAt: "2024-06-14T14:39:52Z"
+   reconciledAt: "2024-06-14T14:39:53Z"
+   reconciledAt: "2024-06-14T14:39:54Z"
+   reconciledAt: "2024-06-14T14:39:55Z"
+   reconciledAt: "2024-06-14T14:39:56Z"
+   reconciledAt: "2024-06-14T14:39:57Z"
+   reconciledAt: "2024-06-14T14:39:58Z"
+   reconciledAt: "2024-06-14T14:39:59Z"
+   reconciledAt: "2024-06-14T14:40:00Z"
+   reconciledAt: "2024-06-14T14:40:01Z"
+   reconciledAt: "2024-06-14T14:40:02Z"

If I disable the various healthcheck scripts, then the update rate drops dramatically:

+   reconciledAt: "2024-06-14T13:37:54Z"
+   reconciledAt: "2024-06-14T13:38:08Z"
+   reconciledAt: "2024-06-14T13:38:22Z"
+   reconciledAt: "2024-06-14T13:38:43Z"
+   reconciledAt: "2024-06-14T13:38:58Z"

This is not the only issue - but it's certainly contributing significantly.

Here's our current config:

k get cm argocd-cm -n argocd-system -o yaml | yq '.data."resource.customizations"'
apps/Deployment:
  health.lua.useOpenLibs: true
  health.lua: |-
    hc = {}

    if obj.spec.paused ~= nil and obj.spec.paused then
      hc.status = "Suspended"
      hc.message = "Deployment is paused"
      return hc
    end

    if obj.metadata.generation <= obj.status.observedGeneration
    then
      if obj.status.conditions ~= nil and obj.status.replicas ~= nil and obj.status.updatedReplicas ~= nil then

        for _, condition in ipairs(obj.status.conditions) do
          if condition.reason == "ProgressDeadlineExceeded" then
            hc.status = "Degraded"
            hc.message = string.format("Deployment %s has exceeded its progress deadline", obj.metadata.name)
            return hc
          end
        end

        if obj.metadata.annotations["app.mydomain.com/deploymentHealthyPercentage"] ~= nil and obj.status.availableReplicas ~= nil then
          current_deploy = (obj.status.availableReplicas/obj.status.replicas) * 100
          threshold = obj.metadata.annotations["app.mydomain.com/deploymentHealthyPercentage"]
          healthy_threshold = tonumber(threshold)
          if (current_deploy ~= nil and healthy_threshold ~= nil and (current_deploy >= healthy_threshold)) then
            hc.status = "Healthy"
            hc.message = string.format("Healthy deployment threshold met with %d percent available replicas / desired replicas", healthy_threshold)
            return hc
          end
        end

        if obj.spec.replicas ~= nil and (obj.status.updatedReplicas < obj.spec.replicas) then
          hc.status = "Progressing"
          hc.message = string.format("Waiting for rollout to finish: %d out of %d new replicas have been updated",  obj.status.updatedReplicas, obj.status.replicas)
          return hc
        end

        if (obj.status.replicas > obj.status.updatedReplicas) then
          old_replicas = obj.status.replicas - obj.status.updatedReplicas
          hc.status= "Progressing"
          hc.message = string.format("Waiting for rollout to finish: %d replicas are pending termination", old_replicas)
          return hc
        end

        if (obj.status.availableReplicas ~= nil and obj.status.availableReplicas < obj.status.updatedReplicas) then
          hc.status= "Progressing"
          hc.message = string.format("Waiting for rollout to finish: %d of %d updated replicas are available...",  obj.status.availableReplicas, obj.status.replicas)
          return hc
        end

      end
    else
      hc.status = "Progressing"
      hc.message = "Waiting for rollout to finish: observed deployment generation less than desired generation"
      return hc
    end

    hc.status = "Healthy"
    hc.message = ""
    return hc
flink.apache.org/FlinkDeployment:
  health.lua: |-
    health_status = {}

    if obj.status ~= nil and obj.status.reconciliationStatus ~= nil then
    if obj.status.reconciliationStatus.state == "DEPLOYED" then
    health_status.status = "Healthy"
    return health_status
    end

    if obj.status.jobManagerDeploymentStatus == "DEPLOYED_NOT_READY" or obj.status.jobManagerDeploymentStatus == "DEPLOYING" then
    health_status.status = "Progressing"
    health_status.message = "Waiting for deploying"
    return health_status
    end

    if obj.status.jobManagerDeploymentStatus == "ERROR" then
    health_status.status = "Degraded"
    health_status.message = obj.status.reconciliationStatus.error
    return health_status
    end
    end

    health_status.status = "Progressing"
    health_status.message = "Waiting for Flink operator"
    return health_status
"*.upbound.io/*":
  health.lua: |-
    health_status = {
      status = "Progressing",
      message = "Provisioning ..."
    }

    local function contains (table, val)
      for i, v in ipairs(table) do
        if v == val then
          return true
        end
      end
      return false
    end

    local has_no_status = {
      "ProviderConfig",
      "ProviderConfigUsage"
    }

    if obj.status == nil or next(obj.status) == nil and contains(has_no_status, obj.kind) then
      health_status.status = "Healthy"
      health_status.message = "Resource is up-to-date."
      return health_status
    end

    if obj.status == nil or next(obj.status) == nil or obj.status.conditions == nil then
      if obj.kind == "ProviderConfig" and obj.status.users ~= nil then
        health_status.status = "Healthy"
        health_status.message = "Resource is in use."
        return health_status
      end
      return health_status
    end

    for i, condition in ipairs(obj.status.conditions) do
      if condition.type == "LastAsyncOperation" then
        if condition.status == "False" then
          health_status.status = "Degraded"
          health_status.message = condition.message
          return health_status
        end
      end

      if condition.type == "Synced" then
        if condition.status == "False" then
          health_status.status = "Degraded"
          health_status.message = condition.message
          return health_status
        end
      end

      if condition.type == "Ready" then
        if condition.status == "True" then
          health_status.status = "Healthy"
          health_status.message = "Resource is up-to-date."
          return health_status
        end
      end
    end

    return health_status
"*.crossplane.io/*":
  health.lua: |-
    health_status = {
      status = "Progressing",
      message = "Provisioning ..."
    }

    local function contains (table, val)
      for i, v in ipairs(table) do
        if v == val then
          return true
        end
      end
      return false
    end

    local has_no_status = {
      "Composition",
      "CompositionRevision",
      "DeploymentRuntimeConfig",
      "ControllerConfig",
      "ProviderConfig",
      "ProviderConfigUsage"
    }
    if obj.status == nil or next(obj.status) == nil and contains(has_no_status, obj.kind) then
        health_status.status = "Healthy"
        health_status.message = "Resource is up-to-date."
      return health_status
    end

    if obj.status == nil or next(obj.status) == nil or obj.status.conditions == nil then
      if obj.kind == "ProviderConfig" and obj.status.users ~= nil then
        health_status.status = "Healthy"
        health_status.message = "Resource is in use."
        return health_status
      end
      return health_status
    end

    for i, condition in ipairs(obj.status.conditions) do
      if condition.type == "LastAsyncOperation" then
        if condition.status == "False" then
          health_status.status = "Degraded"
          health_status.message = condition.message
          return health_status
        end
      end

      if condition.type == "Synced" then
        if condition.status == "False" then
          health_status.status = "Degraded"
          health_status.message = condition.message
          return health_status
        end
      end

      if contains({"Ready", "Healthy", "Offered", "Established"}, condition.type) then
        if condition.status == "True" then
          health_status.status = "Healthy"
          health_status.message = "Resource is up-to-date."
          return health_status
        end
      end
    end

    return health_status
"*.aws.mydomain.com/*":
  health.lua: |-
    health_status = {
      status = "Progressing",
      message = "Provisioning ..."
    }

    if obj.status == nil or obj.status.conditions == nil then
      return health_status
    end

    for i, condition in ipairs(obj.status.conditions) do
      if condition.type == "Synced" then
        if condition.status == "False" then
          health_status.status = "Degraded"
          health_status.message = condition.message
          return health_status
        end
      end

      if condition.type == "Ready" then
        if condition.status == "True" then
          health_status.status = "Healthy"
          health_status.message = "Resource is up-to-date."
          return health_status
        else
          health_status.status = "Degraded"
          health_status.message = condition.message
          return health_status
        end
      end
    end

    return health_status
argoproj.io/Workflow:
  health.lua: |2-
     hs = {}
     if obj.metadata.annotations["app.mydomain.com/suppressWorkflowHealthCheck"] == "true" then
       hs.status = "Healthy"
       return hs
     end
     if obj.status ~= nil then
       if obj.status.phase ~= nil then
         if obj.status.phase == "Succeeded" then
           hs.status = "Healthy"
           return hs
         end
         if obj.status.phase == "Failed" then
           hs.status = "Degraded"
           hs.message = obj.status.message
           return hs
         end
       end
     end
     hs.status = "Progressing"
     return hs
cert-manager.io/Certificate:
  health.lua: |-
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "Ready" and condition.status == "False" then
            hs.status = "Degraded"
            hs.message = condition.message
            return hs
          end
          if condition.type == "Ready" and condition.status == "True" then
            hs.status = "Healthy"
            hs.message = condition.message
            return hs
          end
        end
      end
    end
    hs.status = "Progressing"
    hs.message = "Waiting for certificate"
    return hs
health.certmanager.k8s.io/Certificate:
  health.lua: |-
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "Ready" and condition.status == "False" then
            hs.status = "Degraded"
            hs.message = condition.message
            return hs
          end
          if condition.type == "Ready" and condition.status == "True" then
            hs.status = "Healthy"
            hs.message = condition.message
            return hs
          end
        end
      end
    end
    hs.status = "Progressing"
    hs.message = "Waiting for certificate"
    return hs

Here's a graph of the log messages indicating the application controller is going to update the app... you can clearly see where I removed the lua healthchecks: image

diranged avatar Jun 14 '24 14:06 diranged

@alexmt After disabling the Lua healthchecks - I discovered finally a second but more critical issue:

image

So the first issue above is that we have Lua healthcheck script triggering some kind of reconciliation... the second issue is that the controller believes there is a "difference" between the desired and current status of the object. The issue starts in the needRefreshAppStatus() function when it calls currentSourceEqualsSyncedSource() which then calls app.Spec.Source.Equals(&app.Status.Sync.ComparedTo.Source).

The comparison here is really simple https://github.com/argoproj/argo-cd/blob/03e6f8ae3f19cdd9ba910bc88d421f519763b287/pkg/apis/application/v1alpha1/types.go#L2807-L2825

So I went and looked at the status.sync.comparedTo.source field on the objects and I think I found the smoking gun (using dyff:

% k get application app -o yaml | yq .status.sync.comparedTo.source > comparedTo.yaml
% k get application app -o yaml | yq .spec.source > source.yaml
% % dyff between source.yaml comparedTo.yaml
     _        __  __
   _| |_   _ / _|/ _|  between source.yaml
 / _' | | | | |_| |_       and comparedTo.v1.yaml
| (_| | |_| |  _|  _|
 \__,_|\__, |_| |_|   returned one difference
        |___/

helm
  + two map entries added:
    parameters:
    - name: resources.requests.memory
    │ value: 1Gi
    - name: resources.requests.cpu
    │ value: 500m
    valuesObject:
    │ cluster:
    │ │ name: ...
    ...

The smoking gun here is that the objects .status.sync.comparedTo.source.parameters and .status.sync.comparedTo.source.valuesObject fields do not exist in the .spec.source ... they were fields we set at one point, but they are no longer there. Yet when the Application gets updated by the argocd-controller, those fields are not being purged.

I sanity checked a bunch of other applications .. and in every single case of an application that is in one of these loops, I found some field that exists in the .status.sync.comparedTo.source but does not exist in .spec.source.

I believe that the controller is not properly handling removed fields from an application spec, which then puts the applications into a bad state where the controller constantly sees a "diff".

diranged avatar Jun 14 '24 14:06 diranged

@diranged I was actually working on an issue talking about this diffing issue, because I thought this was possibly a separate issue, but I'll put it in here.

We had migrated a bunch of applications to use helm repositories rather than git, which meant that we replaced the .path in source with a .chart, but after reconciling the .path field was not unset in .status.sync.comparedTo, which caused a massive amount of reconciles. Every application would get reconciled every 3 seconds.

We patched the status to remove the .path field, which helped a bunch.

We still see some odd behaviour, if we use .valuesObject to configure our applications. Instead of reconciling once per reconciliation period, we see one reconciliation and then immediately another. When looking in the log we also see the (spec.source differs) statements.

I looked in the code, and I can see that the valuesObject is a RawExtension backed by a bytearray, and since the comparison is a reflect.DeepEqual, anything like ordering or structuring could cause a diff.

I'm not sure why it doesn't spin out of control like in case of the .path diff, though 🤔

PaulFarver avatar Jun 14 '24 16:06 PaulFarver

@PaulFarver I couldn't figure out how to make kubectl patch update the status to remove the old data.. can you post how you did that?

diranged avatar Jun 14 '24 18:06 diranged

@diranged You'll have to use a replace operation with a json patch, and set the field to null.

I also ran into a race condition, since I was trying to update the resource while Argo was running wild, so sometimes Argo would overwrite my patch.

I would suggest scaling down Argo before patching

PaulFarver avatar Jun 14 '24 20:06 PaulFarver

(sorry didnt meant to close this!)

diranged avatar Jun 14 '24 20:06 diranged

@diranged I was actually working on an issue talking about this diffing issue, because I thought this was possibly a separate issue, but I'll put it in here.

We had migrated a bunch of applications to use helm repositories rather than git, which meant that we replaced the .path in source with a .chart, but after reconciling the .path field was not unset in .status.sync.comparedTo, which caused a massive amount of reconciles. Every application would get reconciled every 3 seconds.

We patched the status to remove the .path field, which helped a bunch.

We still see some odd behaviour, if we use .valuesObject to configure our applications. Instead of reconciling once per reconciliation period, we see one reconciliation and then immediately another. When looking in the log we also see the (spec.source differs) statements.

I looked in the code, and I can see that the valuesObject is a RawExtension backed by a bytearray, and since the comparison is a reflect.DeepEqual, anything like ordering or structuring could cause a diff.

I'm not sure why it doesn't spin out of control like in case of the .path diff, though 🤔

reflect.DeepEqual considers nil slices/maps not equal to empty slices/maps I wonder if that is the case here.

https://github.com/argoproj/argo-cd/blob/57965576f01f77ade431b10c225e4509dd77b02b/applicationset/controllers/applicationset_controller.go#L1606

rumstead avatar Jun 14 '24 21:06 rumstead

@alexmt, @rumstead, What's the right way to get some attention on this from the development team to find a fix?

diranged avatar Jun 22 '24 09:06 diranged

@alexmt, @rumstead, What's the right way to get some attention on this from the development team to find a fix?

I would try slack or join one of the community meetings. Github is swamped.

rouke-broersma avatar Jun 22 '24 10:06 rouke-broersma

@alexmt, @rumstead, What's the right way to get some attention on this from the development team to find a fix?

Yup slack or come to the office hours. https://github.com/argoproj/argo-cd?tab=readme-ov-file#contribution-discussion-and-support

rumstead avatar Jun 22 '24 10:06 rumstead

I think this is a similar issue that we are starting to see - was there any discussion on this elsewhere out of GitHub that can be shared?

baileydoestech avatar Jul 05 '24 15:07 baileydoestech

I believe that the controller is not properly handling removed fields from an application spec, which then puts the applications into a bad state where the controller constantly sees a "diff".

That's my observation as well. I'm seeing that .status.sync.comparedTo.source.directory is sitting around even though it should be deleted.

crenshaw-dev avatar Jul 05 '24 15:07 crenshaw-dev

@alexmt it's a bug in the Application controller, not the ApplicationSet controller.

Is the patch just... backwards? Like, are we diffing the live source and the comparedTo source and just getting the difference, no indication of whether the field is meant to be removed or added?

crenshaw-dev avatar Jul 05 '24 16:07 crenshaw-dev

Yes we are seeing excessive CPU usage on the application controller post upgrade and a consistently high reconciliation rate

baileydoestech avatar Jul 05 '24 16:07 baileydoestech

We have the same problem after upgrading to 2.11.3. We can easily reproduce the problem using "parameter override" in the UI. we use an app spec like this:

spec:
  ...
  source:
    helm:
      releaseName: something
      valueFiles:
        - ../../values/file1.yaml      
    path: charts/istio-something
    ...

after parameter override the field "parameters" is added:

     source:
        helm:
          parameters:
            - name: 'myparam'
              value: '5001'
          releaseName: something

if you now remove the parameter override in the UI the section "parameters:" is removed in the app spec but it remains in the "comparedTo" section. --> ApplicationController enters a infinite reconcile - loop :-(

is there any workaround or fix already known?

Sven1410 avatar Jul 15 '24 08:07 Sven1410

We have the same problem after upgrading to 2.11.3. We can easily reproduce the problem using "parameter override" in the UI. we use an app spec like this:

spec:
  ...
  source:
    helm:
      releaseName: something
      valueFiles:
        - ../../values/file1.yaml      
    path: charts/istio-something
    ...

after parameter override the field "parameters" is added:

     source:
        helm:
          parameters:
            - name: 'myparam'
              value: '5001'
          releaseName: something

if you now remove the parameter override in the UI the section "parameters:" is removed in the app spec but it remains in the "comparedTo" section. --> ApplicationController enters a infinite reconcile - loop :-(

is there any workaround or fix already known?

Try either 1) setting parameters to an empty array ([]) or 2) re-creating the argocd app resource

SamuraiPrinciple avatar Jul 15 '24 08:07 SamuraiPrinciple

I think you could clear the loop by blowing away the status field and letting Argo CD repopulate it.

crenshaw-dev avatar Jul 15 '24 13:07 crenshaw-dev

Thanks for your ideas, unfortunately we will have the loop again every time our developers override some parameters (in the UI) and remove that later. -> setting parameters to an empty array ([]) is maybe ok but it is a lot of work and generates tons of changes for all (productive) clusters (we are using the "app of apps" pattern) and we probably need that for "values", "valueFiles" and other (helm:) fields, too...

Sven1410 avatar Jul 15 '24 13:07 Sven1410

100%, it's not a permanent fix.

The problem seems to be how the controller constructs the status patch, particularly the part of the patch that contains a copy of the source field.

To fix the bug, we'll have to fix how the patch works.

Unfortunately I don't have time to dig in right now. It'll probably involve running a debugger and inspecting the JSON patch in the app controller.

crenshaw-dev avatar Jul 15 '24 13:07 crenshaw-dev

We downgraded to 2.10.9 - now I cannot reproduce the problem anymore. It was still there in 2.10.10. So probably the assumption that #18061 is responsible is correct.

Sven1410 avatar Jul 17 '24 18:07 Sven1410

Hit the same issue when changing the destination from server to name. See #19194 for details.

Patching the apps seemed to resolve the issue for me in the end. As a one-liner (please verify before patching anything):

kubectl get app -o=jsonpath='{.items[?(@.status.sync.comparedTo.destination.server)].metadata.name}' | xargs -n1 -I{} kubectl patch app {} --type=json -p='[{"op": "remove", "path": "/status/sync/comparedTo/destination/server"}]'

wrdls avatar Jul 24 '24 19:07 wrdls

Hey folks, just chiming in here to say we saw some similar issues on 2.11.7 Looping of app reconciliation calls, etc.

Downgrading to 2.10.9 has resolved the issue for now, but it would be great to see this fixed :D

alexstojda avatar Jul 26 '24 17:07 alexstojda