Very slow performance of newest argocd versions - plugin + monorepo
Checklist:
- [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- [x] I've included steps to reproduce the bug.
- [x] I've pasted the output of
argocd version.
Describe the bug
Unfortunately, upgrading to the newest versions of ArgoCD (with the cmp migrated to a sidecar) resulted in an order of magnitude degradation of refresh/sync/deployment speeds.
Setup
Monorepo with c. 50 Applications, all defined under a single sub-dir, written in jsonnet + tanka. An Application in our realm is a microservice (deployment + a few helper manifests), nothing extraordinary. We run repo-server with --plugin-tar-exclude set to .git/*.
Bug / Observations
- It takes approximately 40 minutes in total (with some parallelism) to invoke
app set(to specifyimage_tag, a plugin variable) and thenapp syncfor all Applications. It takes c. 3-5 minutes per a single Application. In contrast, running a diff and applying manually (tk apply . --ext-str=image_tag=MY_TAG) takes a fraction of this time (c. 5 seconds per Application) - It takes approximately 30-60 seconds to invoke a refresh operation on a single Application (via UI).
- It takes approximately 20-50 seconds to open
Parameterspage on a single Application Details view (UI). This is quite interesting, as showing theDifftab is (usually) almost instantaneous. - The repo server's CPU usage is spiking significantly during the deployment
Normal operation:
NAME CPU(cores) MEMORY(bytes)
argo-cd-argocd-application-controller-0 26m 868Mi
argo-cd-argocd-repo-server-77f98c748c-8w7z9 2m 705Mi
During a manually-triggered update (app set) + sync:
NAME CPU(cores) MEMORY(bytes)
argo-cd-argocd-application-controller-0 456m 930Mi
argo-cd-argocd-repo-server-77f98c748c-8w7z9 5220m 769Mi
Potential Solutions
We see some hope (and look forward to) to the following potential solutions (naturally, it is difficult to gauge a priori to what extent any of them would resolve the issues observed):
- Reintroducing the previous way of configuring CMPs (i.e. not via sidecars), perhaps as an alternative, leaving the choice (including any potential security implications) up to the users.
- Introducing
--plugin-tar-include(i.e. include-only manifests dir) - probably of limited benefit, since we are already passing--plugin-tar-exclude, which excludes the bulk of the repo. - Supporting sparse and/or limited depth checkout (I note the already started, and much appreciated, albeit potentially put on hold, work in https://github.com/argoproj/argo-cd/pull/16064 and https://github.com/argoproj/argo-cd/pull/14272)
- Independent optimisation / debugging of the Parameters tab (surely, merely viewing the existing inputs should not be very time/resource consuming, perhaps indicating some unnecessary steps in the current implementation?)
- ... probably many more, which I cannot think of immediately.
To Reproduce
Store manifests of 50-100 Applications in a mono-repo, together with other code (Go, TS, etc.). Use tanka to apply them, configured via an argocd sidecar plugin (example plugin configuration here).
Expected behavior
A simple update of minimal changes should be relatively fast (taking a bit longer than a manual application, but not 10-50x longer).
Version
argocd-server: v2.10.6+d504d2b
I don't think we have enough information to really brainstorm solutions yet.
The spiked repo-server CPU usage is a good hint. Do you know if any Argo CD component is hitting its CPU limits at any time, i.e. being throttled?
I appreciate I owe you a proper analysis. My apologies - I still haven't had a chance to set up prometheus etc. to scrape the metrics / traces / set up pprofiler etc.
To just answer you previous question - there i no throttling being observed, the nodes are fairly big with plenty of RAM and CPU headroom (under normal circumstances).
In the meantime, I wanted to share the following, in case useful and perhaps symptomatic of other issues listed above.
When navigating to Application -> Details -> Params tab (which, in theory, should only show two variables as inputs), argoCD is timing out (and not loading the tab).
The following three log lines can be observed (note the 35+ seconds duration before the kill):
repo-server (container: repo-server)
"jsonPayload": {
"error": "failed to populate plugin app details: error sending file to cmp-server: error sending generate manifest metadata to cmp-server: EOF",
"grpc.method": "GetAppDetails",
"system": "grpc",
"level": "error",
"msg": "finished unary call with code Unknown",
"span.kind": "server",
"grpc.code": "Unknown",
"grpc.start_time": "2024-05-16T12:06:25Z",
"grpc.time_ms": 35504.46,
"grpc.service": "repository.RepoServerService"
},
repo-server (container: tanka)
"jsonPayload": {
"grpc.code": "Canceled",
"level": "info",
"system": "grpc",
"msg": "finished streaming call with code Canceled",
"span.kind": "server",
"grpc.method": "GetParametersAnnouncement",
"error": "parameters announcement error receiving stream: error receiving stream header: rpc error: code = Canceled desc = context canceled",
"grpc.time_ms": 8257.146,
"grpc.service": "plugin.ConfigManagementPluginService",
"grpc.start_time": "2024-05-16T12:06:47Z"
},
repo-server (container: tanka)
"jsonPayload": {
"span.kind": "server",
"level": "info",
"grpc.service": "plugin.ConfigManagementPluginService",
"grpc.method": "MatchRepository",
"system": "grpc",
"grpc.time_ms": 20714.465,
"grpc.code": "OK",
"msg": "finished streaming call with code OK",
"grpc.start_time": "2024-05-16T12:06:26Z"
},
In case useful:
- total repo size: 320MB (inc. 216M for
.git) -
kubernetesfolder size (all manifests + libraries): 27M - single application folder (located in
kubernetes/.../someapp) size: 20K
repo-server is run with the following args (the exclusions list all folders other than our kubernetes folder):
containers:
- args:
- /usr/local/bin/argocd-repo-server
- --port=8081
- --metrics-port=8084
- --plugin-tar-exclude=".git/*"
- --plugin-tar-exclude="assets/*"
- --plugin-tar-exclude="bin/*"
- --plugin-tar-exclude="build/*"
- --plugin-tar-exclude="docs/*"
- --plugin-tar-exclude="go/*"
- --plugin-tar-exclude="js/*"
- --plugin-tar-exclude="proto/*"
- --parallelismlimit=10
All applications are generated from AppSets looking at kubernetes/.../mycluster etc.
For the record - running tk show . --ext-str=image_tag=sometag for a given Application (which is what the CMP does for generate; init is just echo "tanka plugin init") takes c. 153ms (I've just checked). All jsonnet libraries are vendored (hence the 27MB size), so nothing is being downloaded on the fly.
It feels suspicious that: (a) sending across 27MB would take over 30 seconds and time out; (b) it is necessary to resolve the manifests to just show the Params (often used e.g. to quickly check the image_tag provided to argo-cd etc.).
(Our main problem is the overall sync-up of 30-50 Applications on deployment, which can take 30min+ just to update image tags, but... perhaps the above issue, which is also quite troublesome, is related).
Regarding sparse/shallow checkout, let's consolidate that part of the conversation here: https://github.com/argoproj/argo-cd/issues/11198
I think performance will be improved by #18053 (to be released in v2.13), where we can skip MatchRepo if we set pluginName explicitly.
I noticed that MatchRepo takes a very long time for large monorepos.
Hi All,
I am experiencing the same issues in ArgoCD 2.12.4. I have an Appliction with explicitly set plugin name and a plugin without discover action. I see in the logs significant time spent in MatchRepository call, how this can be improved? Thanks
time="2025-01-17T09:57:49Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=MatchRepository grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2025-01-17T09:57:37Z" grpc.time_ms=11753.513 span.kind=server system=grpc time="2025-01-17T09:58:01Z" level=info msg="Generating manifests with no request-level timeout" time="2025-01-17T09:58:01Z" level=info msg="avp-helm init" dir=/tmp/_cmp_server/79dac6cf-02a3-4e00-a51c-34cbfcdaa799/gen/charts/test execID=19416 time="2025-01-17T09:58:04Z" level=warning msg="Plugin command returned zero output" command="{[avp-helm init] []}" execID=19416 stderr= time="2025-01-17T09:58:04Z" level=info msg="bash -c "eval $(avp-helm generate)"" dir=/tmp/_cmp_server/79dac6cf-02a3-4e00-a51c-34cbfcdaa799/gen/charts/test execID=6811e time="2025-01-17T09:58:04Z" level=info msg="Plugin command successful" command="{[bash -c eval $(avp-helm generate)] []}" execID=6811e stderr= time="2025-01-17T09:58:05Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2025-01-17T09:57:49Z" grpc.time_ms=15430.835 span.kind=server system=grpc
Simply upgrade to v2.13 and specify plugin name explicitly! It works perfect for us. MatchRepository is gone.
Is there any chance the fix to be downported to 2.12 ?
Hi! Maybe this issue helps too https://github.com/argoproj/argo-cd/issues/17951 Available since 2.14-rc2
@DimitarKapashikov since it's a major feature/change, we don't be backporting #17951.
The change in 2.14 provides a mode to have the repo-server respect the manifest-generate-paths annotation when transferring data to the plugin. So this will help if:
- The main bottleneck is monorepo size - too much data is being transferred from the repo-server to the plugin (this happens as a gzipped grpc stream over a localhost websocket)
- You have configured accurate
manifest-generate-pathsannotations on the apps using the plugin - The selected paths represent much less disk space than the full repo would (this can be difficult when using Kustomize, which often requires targeting a large subset of repo files)
- You have upgraded to 2.14
- You have enabled the feature