argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Very slow performance of newest argocd versions - plugin + monorepo

Open momilo opened this issue 1 year ago • 3 comments

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [x] I've included steps to reproduce the bug.
  • [x] I've pasted the output of argocd version.

Describe the bug

Unfortunately, upgrading to the newest versions of ArgoCD (with the cmp migrated to a sidecar) resulted in an order of magnitude degradation of refresh/sync/deployment speeds.

Setup

Monorepo with c. 50 Applications, all defined under a single sub-dir, written in jsonnet + tanka. An Application in our realm is a microservice (deployment + a few helper manifests), nothing extraordinary. We run repo-server with --plugin-tar-exclude set to .git/*.

Bug / Observations

  1. It takes approximately 40 minutes in total (with some parallelism) to invoke app set (to specify image_tag, a plugin variable) and then app sync for all Applications. It takes c. 3-5 minutes per a single Application. In contrast, running a diff and applying manually (tk apply . --ext-str=image_tag=MY_TAG) takes a fraction of this time (c. 5 seconds per Application)
  2. It takes approximately 30-60 seconds to invoke a refresh operation on a single Application (via UI).
  3. It takes approximately 20-50 seconds to open Parameters page on a single Application Details view (UI). This is quite interesting, as showing the Diff tab is (usually) almost instantaneous.
  4. The repo server's CPU usage is spiking significantly during the deployment

Normal operation:

NAME                                                        CPU(cores)   MEMORY(bytes)
argo-cd-argocd-application-controller-0                     26m          868Mi
argo-cd-argocd-repo-server-77f98c748c-8w7z9                 2m           705Mi

During a manually-triggered update (app set) + sync:

NAME                                                        CPU(cores)   MEMORY(bytes)
argo-cd-argocd-application-controller-0                     456m         930Mi
argo-cd-argocd-repo-server-77f98c748c-8w7z9                 5220m        769Mi

Potential Solutions

We see some hope (and look forward to) to the following potential solutions (naturally, it is difficult to gauge a priori to what extent any of them would resolve the issues observed):

  1. Reintroducing the previous way of configuring CMPs (i.e. not via sidecars), perhaps as an alternative, leaving the choice (including any potential security implications) up to the users.
  2. Introducing --plugin-tar-include (i.e. include-only manifests dir) - probably of limited benefit, since we are already passing --plugin-tar-exclude, which excludes the bulk of the repo.
  3. Supporting sparse and/or limited depth checkout (I note the already started, and much appreciated, albeit potentially put on hold, work in https://github.com/argoproj/argo-cd/pull/16064 and https://github.com/argoproj/argo-cd/pull/14272)
  4. Independent optimisation / debugging of the Parameters tab (surely, merely viewing the existing inputs should not be very time/resource consuming, perhaps indicating some unnecessary steps in the current implementation?)
  5. ... probably many more, which I cannot think of immediately.

To Reproduce

Store manifests of 50-100 Applications in a mono-repo, together with other code (Go, TS, etc.). Use tanka to apply them, configured via an argocd sidecar plugin (example plugin configuration here).

Expected behavior

A simple update of minimal changes should be relatively fast (taking a bit longer than a manual application, but not 10-50x longer).

Version

argocd-server: v2.10.6+d504d2b

momilo avatar Apr 08 '24 12:04 momilo

I don't think we have enough information to really brainstorm solutions yet.

The spiked repo-server CPU usage is a good hint. Do you know if any Argo CD component is hitting its CPU limits at any time, i.e. being throttled?

crenshaw-dev avatar Apr 08 '24 19:04 crenshaw-dev

I appreciate I owe you a proper analysis. My apologies - I still haven't had a chance to set up prometheus etc. to scrape the metrics / traces / set up pprofiler etc.

To just answer you previous question - there i no throttling being observed, the nodes are fairly big with plenty of RAM and CPU headroom (under normal circumstances).

In the meantime, I wanted to share the following, in case useful and perhaps symptomatic of other issues listed above.

When navigating to Application -> Details -> Params tab (which, in theory, should only show two variables as inputs), argoCD is timing out (and not loading the tab).

The following three log lines can be observed (note the 35+ seconds duration before the kill):

repo-server (container: repo-server)

"jsonPayload": {
    "error": "failed to populate plugin app details: error sending file to cmp-server: error sending generate manifest metadata to cmp-server: EOF",
    "grpc.method": "GetAppDetails",
    "system": "grpc",
    "level": "error",
    "msg": "finished unary call with code Unknown",
    "span.kind": "server",
    "grpc.code": "Unknown",
    "grpc.start_time": "2024-05-16T12:06:25Z",
    "grpc.time_ms": 35504.46,
    "grpc.service": "repository.RepoServerService"
  },

repo-server (container: tanka)

"jsonPayload": {
    "grpc.code": "Canceled",
    "level": "info",
    "system": "grpc",
    "msg": "finished streaming call with code Canceled",
    "span.kind": "server",
    "grpc.method": "GetParametersAnnouncement",
    "error": "parameters announcement error receiving stream: error receiving stream header: rpc error: code = Canceled desc = context canceled",
    "grpc.time_ms": 8257.146,
    "grpc.service": "plugin.ConfigManagementPluginService",
    "grpc.start_time": "2024-05-16T12:06:47Z"
},

repo-server (container: tanka)

"jsonPayload": {
    "span.kind": "server",
    "level": "info",
    "grpc.service": "plugin.ConfigManagementPluginService",
    "grpc.method": "MatchRepository",
    "system": "grpc",
    "grpc.time_ms": 20714.465,
    "grpc.code": "OK",
    "msg": "finished streaming call with code OK",
    "grpc.start_time": "2024-05-16T12:06:26Z"
  },

In case useful:

  • total repo size: 320MB (inc. 216M for .git)
  • kubernetes folder size (all manifests + libraries): 27M
  • single application folder (located in kubernetes/.../someapp) size: 20K

repo-server is run with the following args (the exclusions list all folders other than our kubernetes folder):

containers:
  - args:
    - /usr/local/bin/argocd-repo-server
    - --port=8081
    - --metrics-port=8084
    - --plugin-tar-exclude=".git/*"
    - --plugin-tar-exclude="assets/*"
    - --plugin-tar-exclude="bin/*"
    - --plugin-tar-exclude="build/*"
    - --plugin-tar-exclude="docs/*"
    - --plugin-tar-exclude="go/*"
    - --plugin-tar-exclude="js/*"
    - --plugin-tar-exclude="proto/*"
    - --parallelismlimit=10

All applications are generated from AppSets looking at kubernetes/.../mycluster etc.

For the record - running tk show . --ext-str=image_tag=sometag for a given Application (which is what the CMP does for generate; init is just echo "tanka plugin init") takes c. 153ms (I've just checked). All jsonnet libraries are vendored (hence the 27MB size), so nothing is being downloaded on the fly.

It feels suspicious that: (a) sending across 27MB would take over 30 seconds and time out; (b) it is necessary to resolve the manifests to just show the Params (often used e.g. to quickly check the image_tag provided to argo-cd etc.).

(Our main problem is the overall sync-up of 30-50 Applications on deployment, which can take 30min+ just to update image tags, but... perhaps the above issue, which is also quite troublesome, is related).

momilo avatar May 16 '24 12:05 momilo

Regarding sparse/shallow checkout, let's consolidate that part of the conversation here: https://github.com/argoproj/argo-cd/issues/11198

crenshaw-dev avatar Oct 09 '24 20:10 crenshaw-dev

I think performance will be improved by #18053 (to be released in v2.13), where we can skip MatchRepo if we set pluginName explicitly. I noticed that MatchRepo takes a very long time for large monorepos. image

toyamagu-2021 avatar Oct 27 '24 14:10 toyamagu-2021

Hi All,

I am experiencing the same issues in ArgoCD 2.12.4. I have an Appliction with explicitly set plugin name and a plugin without discover action. I see in the logs significant time spent in MatchRepository call, how this can be improved? Thanks

time="2025-01-17T09:57:49Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=MatchRepository grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2025-01-17T09:57:37Z" grpc.time_ms=11753.513 span.kind=server system=grpc time="2025-01-17T09:58:01Z" level=info msg="Generating manifests with no request-level timeout" time="2025-01-17T09:58:01Z" level=info msg="avp-helm init" dir=/tmp/_cmp_server/79dac6cf-02a3-4e00-a51c-34cbfcdaa799/gen/charts/test execID=19416 time="2025-01-17T09:58:04Z" level=warning msg="Plugin command returned zero output" command="{[avp-helm init] []}" execID=19416 stderr= time="2025-01-17T09:58:04Z" level=info msg="bash -c "eval $(avp-helm generate)"" dir=/tmp/_cmp_server/79dac6cf-02a3-4e00-a51c-34cbfcdaa799/gen/charts/test execID=6811e time="2025-01-17T09:58:04Z" level=info msg="Plugin command successful" command="{[bash -c eval $(avp-helm generate)] []}" execID=6811e stderr= time="2025-01-17T09:58:05Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2025-01-17T09:57:49Z" grpc.time_ms=15430.835 span.kind=server system=grpc

DimitarKapashikov avatar Jan 17 '25 10:01 DimitarKapashikov

Simply upgrade to v2.13 and specify plugin name explicitly! It works perfect for us. MatchRepository is gone.

toyamagu-2021 avatar Jan 17 '25 10:01 toyamagu-2021

Is there any chance the fix to be downported to 2.12 ?

DimitarKapashikov avatar Jan 17 '25 12:01 DimitarKapashikov

Hi! Maybe this issue helps too https://github.com/argoproj/argo-cd/issues/17951 Available since 2.14-rc2

jsolana avatar Jan 28 '25 16:01 jsolana

@DimitarKapashikov since it's a major feature/change, we don't be backporting #17951.

crenshaw-dev avatar Feb 26 '25 16:02 crenshaw-dev

The change in 2.14 provides a mode to have the repo-server respect the manifest-generate-paths annotation when transferring data to the plugin. So this will help if:

  1. The main bottleneck is monorepo size - too much data is being transferred from the repo-server to the plugin (this happens as a gzipped grpc stream over a localhost websocket)
  2. You have configured accurate manifest-generate-paths annotations on the apps using the plugin
  3. The selected paths represent much less disk space than the full repo would (this can be difficult when using Kustomize, which often requires targeting a large subset of repo files)
  4. You have upgraded to 2.14
  5. You have enabled the feature

crenshaw-dev avatar Feb 26 '25 16:02 crenshaw-dev