argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Resource tree slow refresh

Open klamkma opened this issue 3 years ago • 40 comments

Hello,

Describe the bug

We have a big kubernetes cluster with almost 3000 argocd applications. Currently we are running ArgoCD 2.2.2. Since upgrade to version 2 we noticed that refresh of the resource tree for applications is much slower. For example: I click on "Restart" for a deployment ReplicaSet appears immediately New pod appears sometimes after 40 seconds I've tried increasing values --status-processors, --operation-processors, --kubectl-parallelism-limit for the controller, but it does not help. Any idea what could we do? Which component is responsible for this refresh, is it argocd-server?

To Reproduce

I click on "Restart" for a deployment ReplicaSet appears immediately New pod appears sometimes after 40 seconds

Expected behavior

Pods should appear faster.

Version

argocd: v2.2.2+03b17e0
  BuildDate: 2022-01-01T06:27:52Z
  GitCommit: 03b17e0233e64787ffb5fcf65c740cc2a20822ba
  GitTreeState: clean
  GoVersion: go1.16.11
  Compiler: gc
  Platform: linux/amd64

Thank you.

klamkma avatar Jan 13 '22 20:01 klamkma

We are same here too. Previously we had 4800+ applications the argocd handles them pretty well, although with some slowness on application listing. After some re-org, we have 3000+ applications now. However, since upgraded v2, the refresh and the sync become very very slow. The refresh action, which supposed to be done in a few seconds can run up to 2 minutes. The sync waiting is even slower. compared to previous using experience, I believe there are large room for performance tuning/improvement.

yydzhou avatar Feb 02 '22 19:02 yydzhou

It is really difficult to troubleshoot it remotelly. The controller might be CPU throttled, repo server might need to be scaled up or control plane K8S API server might be slow.

@klamkma , @yydzhou if possible can we have an interactive session (e.g. zoom call) and debug it together. Later we could document changes we've made to help anyone else who faces this issue.

alexmt avatar Feb 03 '22 06:02 alexmt

Thank you @alexmt. Would be great to have a debug session together with @yydzhou.

yeya24 avatar Feb 03 '22 07:02 yeya24

Hello, I'm available for a session too. Thank you @alexmt.

klamkma avatar Feb 23 '22 10:02 klamkma

Hi again,

I enabled ARGOCD_ENABLE_GRPC_TIME_HISTOGRAM. Could you give me some tips how to use it to investigate performance issues?

Thank you

klamkma avatar Mar 16 '22 07:03 klamkma

Any update about this ? We are experiencing the same issue. It may be a duplicate from this issue.

There is enough RAM, CPU, disk space, and we tried multiplying the number of replicas of the controller and the server pods by 4 just to see if it helps, but not at all.

leotomas837 avatar Mar 15 '23 16:03 leotomas837

I have the same problem. 2.5k Apps, helm, Argo v2.6.6, 1 very big cluster (HML). I cant see any problem like throttle, OOM's or resource starvation. Did all recommended tunning for high performance. Argocd have a pool with some big nodes just for it to play. Tomorrow i will try to debug the kubernetes cluster to see if the control plane is ok.

jujubetsz avatar Mar 31 '23 06:03 jujubetsz

Is there any solution for this, we have somewhere around 6000 applications and argocd version is 2.7.2

  1. Sync and refresh is very slow
  2. Restart and Delete of replicas or deployment doesn't show on ARGO UI.
  3. Whenever you delete deployment, pod or replicas always say doesn't exist on ARGOUI

AnubhavSabarwal avatar May 24 '23 07:05 AnubhavSabarwal

Hi, For us we had a huge improvement in the UI by enabling --enable-gzip, but still pod refresh is very slow.

klamkma avatar May 24 '23 09:05 klamkma

Any news? We have the same problem with even a smaller cluster of about 1000 apps and 6 clusters. I think it might be related to the fact that we have about 5 or 6 plugins but thats not a huge cluster. Any thoughts?

evs-ops avatar Jan 08 '24 15:01 evs-ops

@evs-ops, Hi.

I’ve tried every possible tunning and version in Argocd and got no improvements…. Since my cluster is running in OpenStack/Rancher inside my company cloud, i’m now improving the cluster itself.. Upgrading kubernetes version, etcd performance etc. I’m doing this because i’m seeing lots of timeouts to kubernetes in application-controller and also because none of the tunning worked. logs:

time=“2024-01-08T15:31:23Z” level=info msg=“Failed to watch Deployment.apps on https://x.x.x.x:443: Resyncing Deployment.apps on https://x.x.x.x:443 due to timeout, retrying in 1s” server=“https://kubernetes.default.svc” time=“2024-01-08T15:34:15Z” level=info msg=“Failed to watch Secret on https://x.x.x.x:443: Resyncing Secret on https://x.x.x.x:443 due to timeout, retrying in 1s” server=“https://kubernetes.default.svc” time=“2024-01-08T15:35:20Z” level=info msg=“Failed to watch ReplicationController on https://x.x.x.x:443: Resyncing ReplicationController on https://x.x.x.x:443 due to timeout, retrying in 1s” server=“https://kubernetes.default.svc”

The symptoms i’m experimenting are:

The navigation in argocd web is fast as expected, but if i delete a pod, for exemple, nothing happens. The box with that pod persist in the argocd frontend, but if i watch the namespace using kubectl the pod is beeing killed and a new pod is beeing scheduled. After several minutes (10 ~15m) the new pod spawn in argocd front. This happen with every object owned by argocd.

jujubetsz avatar Jan 08 '24 15:01 jujubetsz

Hi, Very similar to my problem. I can delete and it would probably take about 5 to 10 min. Refresh no less then 2 min up to 5 min. It's new to me since in my previous roles I used argo and it was lightning fast :(

evs-ops avatar Jan 10 '24 11:01 evs-ops

Hi,

Having more clusters to manage is not a bad thing in my point of view. You can have one replica of application-controller for each cluster. Here are some docs and posts that may help you:

https://www.infracloud.io/blogs/sharding-clusters-across-argo-cd-application-controller-replicas/ https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller

Did you tried that?

Another question: Your clusters are managed (GKE, EKS etc) or is the same as me: Self deployed and managed?

jujubetsz avatar Jan 10 '24 11:01 jujubetsz

@evs-ops,

I bumped my version to v2.10.0-rc4 in order to test jitter implementation on reconciliation. You can check the proposal and description issues/14241.

The results till now are incredible, no delay at all in ArgoCD UI/Front. If i delete a pod, the new pod appears instantly, so i recommend you to try if possible. I bumped the version this morning and got an stable environment so far. Will update this thread if something new happens.

Some general info about my environment:

2.7k Apps Lots of monorepos, each team/tribe have one ranging from 10 to 200 apps Only one Cluster Kubernetes v1.25 running in Openstack/Rancher in private cloud ArgoCD components have tons of resources to use Reconciliation timeout: 600s Reconciliation jitter: 180s

git-requests cpu-usage-total network-total reconciliation-activity cluster-events reconciliation-performance

jujubetsz avatar Jan 26 '24 16:01 jujubetsz

Have you found the reason?

machine3 avatar Feb 04 '24 03:02 machine3

+1

When running 3,000 applications and engaging in activities such as syncing 200 applications, clicking "Restart" for a deployment immediately displays the ReplicaSet, but new pods may take up to two minutes to appear.

ritheshgm avatar Feb 14 '24 23:02 ritheshgm

Does anyone have any ideas for solving the problem, or a temporary solution?

machine3 avatar Feb 21 '24 01:02 machine3

Same problem here. The refresh is very slow (~3-5m) per Application. Even with version 2.10.2 1 GIT Repo (monorepo) with ~50 applications CMP argocd-vault-plugin as sidecar deployed

Tried many things, but nothing helps atm

FuturesTr4der avatar Mar 08 '24 16:03 FuturesTr4der

When running 3,000 applications and engaging in activities such as syncing 200 applications, clicking "Restart" for a deployment immediately displays the ReplicaSet, but new pods may take up to two minutes to appear.

+1

gazidizdaroglu avatar Jun 25 '24 19:06 gazidizdaroglu

We are encountering a similar issue. In large clusters where Argo CD monitors numerous resources, it is significantly slow in processing watches—taking approximately 7 minutes in our case. Consequently, the Argo CD UI displays outdated information and adversely affects several functionalities that depend on sync waves, such as PruneLast. Eventually, the volume of events from the cluster overwhelmed the system, causing Argo CD to stall completely.

To mitigate this, we disabled tracking of Pods and ReplicaSets, which unfortunately diminishes one of the primary advantages of the Argo CD UI. We also disregarded all irrelevant events and attempted to optimize various settings in the application controller. However, scaling the application controller vertically showed no effect, and horizontal scaling is not feasible for a single cluster due to sharding constraints.

daftping avatar Jun 25 '24 20:06 daftping

We have removed all argocd config plugins (switched from argocd-vault-plugin to vault-secrets-webhook) and now everything seems to work smoothly

FuturesTr4der avatar Jun 26 '24 09:06 FuturesTr4der

Hey, this thread can help you as well!

https://cloud-native.slack.com/archives/C01TSERG0KZ/p1721141931660909

gazidizdaroglu avatar Jul 19 '24 06:07 gazidizdaroglu

Hey, this thread can help you as well!

https://cloud-native.slack.com/archives/C01TSERG0KZ/p1721141931660909

I'm sorry, I can't access the link you provided. Could you please share some details with me?

machine3 avatar Jul 22 '24 00:07 machine3

We are encountering a similar issue. In large clusters where Argo CD monitors numerous resources, it is significantly slow in processing watches—taking approximately 7 minutes in our case. Consequently, the Argo CD UI displays outdated information and adversely affects several functionalities that depend on sync waves, such as PruneLast. Eventually, the volume of events from the cluster overwhelmed the system, causing Argo CD to stall completely.

To mitigate this, we disabled tracking of Pods and ReplicaSets, which unfortunately diminishes one of the primary advantages of the Argo CD UI. We also disregarded all irrelevant events and attempted to optimize various settings in the application controller. However, scaling the application controller vertically showed no effect, and horizontal scaling is not feasible for a single cluster due to sharding constraints.

We are observing precisely the same issue you described. ArgoCD v2.10.9. @daftping, did you find a way to resolve the issue without disabling tracking pods and replica sets?

mpelekh avatar Aug 09 '24 09:08 mpelekh

The fix is there on master and would be a part of v2.13. It optimizes getting resource tree dfs from O(<tree_size> * <namespace_resource_count>) to O(<namespace_resource_count>)

andrii-korotkov-verkada avatar Aug 09 '24 13:08 andrii-korotkov-verkada

Hi @andrii-korotkov-verkada. Thanks for replying Do you mean the following fixes?

  • https://github.com/argoproj/gitops-engine/commit/6b2984ebc47085852a7b63a0fd0b73c52e986217
  • https://github.com/argoproj/argo-cd/commit/267f243a899483fea0a4e6a613c18f62bd342c7e

Thanks for your contribution. IterateHierarchyV2 looks promising.

I actually patched v2.10.9 with the above commits. It helped, but not to the very end.

Even though patches significantly improve performance, Argo CD still can not handle the load from large clusters.

In the screenshot, you can see one of the largest clusters. Here, the patched with the above commits v2.10.9 build is running.

  • till 12:50, pods and replica sets are disabled from tracking
  • from 12:50 to 13:34, pods and replica sets are enabled to be tracked
  • after 13:34, pods and replica sets are disabled from tracking

As can be seen, once pods and rs are enabled to be tracked, the cluster event count falls close to zero, and reconciliation time increases drastically.

Screenshot 2024-08-09 at 20 40 44

Screenshot 2024-08-09 at 20 51 00

Number of pods in cluster: ~76k Number of rs in cluster: ~52k

@andrii-korotkov-verkada Do you have any ideas on what can be improved?

mpelekh avatar Aug 09 '24 17:08 mpelekh

Are you hitting CPU throttling?

crenshaw-dev avatar Aug 09 '24 17:08 crenshaw-dev

@crenshaw-dev No, we don't set CPU limits at all and still have plenty of resources on the node.

We found that the potential reason is lock contention.

Here, I added a few more metrics and found out that when the number of events is significant, sometimes it takes ~5 minutes to acquire a lock, which leads to a delay in reconciliation. https://github.com/mpelekh/gitops-engine/commit/560ef00bcce9201083200f906f15bf1716fbfcc0#diff-9c9e197d543705f08c9b1bc2dc404a55506cfc2935a988e6007d248257aadb1aR1372

Screenshot 2024-08-09 at 21 11 33

NOTE: The following metrics we got in 2.10.9 patched with the following commits:

  • https://github.com/argoproj/gitops-engine/commit/6b2984ebc47085852a7b63a0fd0b73c52e986217
  • https://github.com/argoproj/argo-cd/commit/267f243a899483fea0a4e6a613c18f62bd342c7e

mpelekh avatar Aug 09 '24 18:08 mpelekh

I had this attempt https://github.com/argoproj/gitops-engine/issues/602, but the benchmark showed neutral-to-regression in terms of throughput. But maybe average latency can get better, idk.

andrii-korotkov-verkada avatar Aug 09 '24 18:08 andrii-korotkov-verkada

I'm curious how much of a performance win you saw from just IterateHierarchy, @mpelekh. Those changes are mostly useful for situations where you have a ton of resources in a single namespace.

Am also super curious if Andrii's locking improvements help with this. If so, that's a strong case for merging those changes.

crenshaw-dev avatar Aug 09 '24 18:08 crenshaw-dev