argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Too many applications causing webhook to timeout

Open KevinM2k opened this issue 1 year ago • 11 comments

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [x] I've included steps to reproduce the bug.
  • [x] I've pasted the output of argocd version.

Describe the bug

We have over 1500 applications, in Gitlab when we have configured a webhook it has a maximum timeout of 10s. Running the webhook locally its taking around 15s, meaning gitlab is timing out and disabling the webhook.

Is there anyway of speeding this up?

To Reproduce

Create over 1500 applications in argocd, call the webhook url within postman with a valid push event, time how long it takes.

Expected behavior

Returns with a response within 10s

Version

v2.6.11+697fd7c

KevinM2k avatar Jun 29 '23 13:06 KevinM2k

We had similar problems in the past. Clusters with ~500 apps, the webhook request take about 10 seconds to process all. With the latest version that time has been decreased to about 5s.

If I am not mistaken in a webhook request ArgoCD processes all applications sequentially instead to parallelise it.

Is there any reason not to do it in parallel?

hlastras avatar Aug 02 '23 08:08 hlastras

I'm fairly certain it can and should be paralleled.

I'm surprised it takes more than a second. Feels a bit like the processing loop is probably doing some network-bound work as part of each iteration. That would probably be worth some investigation.

crenshaw-dev avatar Aug 02 '23 11:08 crenshaw-dev

Faced the same issue, webhook calls taking >10s to finish.

It is surely doing quite some network-bound work here, potentially calling argo.RefreshApp() or storePreviouslyCachedManifests() in each iteration.

https://github.com/argoproj/argo-cd/blob/eb526ff1bdddea09c8dfd90373968fa85ac48b4f/util/webhook/webhook.go#L291-L311

phanama avatar Aug 26 '23 03:08 phanama

While doing the refreshes concurrently should solve the issue since it will be much faster, I was looking at the code and it seems like the result of the operation is not used in any way: https://github.com/argoproj/argo-cd/blob/f33005b10427c9894d1830476423fc36b412debb/util/webhook/webhook.go#L512

I'm wondering if we should instead return a response as soon as the request is validated and do the actual processing in parallel. I don't see why we need to make the webhook wait for the operation to complete if we don't need the result in the response.

alexymantha avatar Sep 13 '23 13:09 alexymantha

I also imagined similar thing and thought that we'd ideally need some kind of queue for the webhook requests, but this would make the argocd-server somewhat stateful. We could optionally use the existing redis as the queue and run background workers inside the argocd-server, not stateful but not sure if we want to do this (additionally redis is not just a cache now).

I'm wondering if we should instead return a response as soon as the request is validated and do the actual processing in parallel.

Thinking about it again, maybe we could just put all processing to the background and then return 200 immediately to webhook clients. The webhook handler anyway just runs to tell argocd-app-controller to refresh specific apps by adding the refresh-annotation. The actual "queue" is all those apps with their refresh-annotation set. But in this setup, we confidently assume that the argocd-server will always finish putting apps into the "queue". Not sure if this is ok, especially if webhook clients might want to do something (e.g. retry or notify/alerts) in case the webhook server crashes while still adding refresh-annotations to apps.

Also, I think we still want the would-be-background app-refreshes to finish quickly (with concurrent processing).

phanama avatar Sep 13 '23 16:09 phanama

I'm running into this too. I don't know that much about ArgoCD internals, but I think refreshing the apps in a background thread and returning a response immediately could be a decent solution - even if the processing fails to complete, it's not the end of the world - apps get refreshed periodically anyway, so "best effort" seems alright (to me anyway).

taliastocks avatar Nov 29 '23 21:11 taliastocks

We are running into this too where our webhook requests are being timed out at argocd. Has there been any update on the fix? I see the PR https://github.com/argoproj/argo-cd/pull/15326 has no updates since 4 months.

Savasw avatar Feb 15 '24 05:02 Savasw

Any update on this bug? or any workaround like increase the timeout?

morawat avatar May 02 '24 22:05 morawat

I have raised a PR to do processing in background in https://github.com/argoproj/argo-cd/pull/18173

dhruvang1 avatar May 12 '24 18:05 dhruvang1

Hello everyone, Just to give an update, we are getting the same situation using GitHub Enterprise but it seems that the webhook events are processed in a few seconds by ArgoCD. We can see the commits after a few seconds of pushing them. In this case I think it's just that the response takes more than the 10 seconds needed to be sent to Github or Gitlab so it is considered as not delivered, altought the payload is processed. Thanks for the PR!

ricardojdsilva87 avatar May 14 '24 08:05 ricardojdsilva87

We are seeing also the problem on GitLab.com we have over 500 Apps and the webhook takes mostly between 10.2 - 10.1 seconds. It would be nice to have implemented upstream.

This results in the enad that GitLab will disable our Webhook integration.

image

solidnerd avatar May 21 '24 06:05 solidnerd

We don't have that many applications (76) and are also hitting the timeout (on a private Gitlab instance) (we do have a complex application set, with several matrix generators (combining git + clusters) nested in a merge generators, so I don't know if this is a factor)

VannTen avatar May 31 '24 10:05 VannTen