argo-cd
argo-cd copied to clipboard
Too many applications causing webhook to timeout
Checklist:
- [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- [x] I've included steps to reproduce the bug.
- [x] I've pasted the output of
argocd version
.
Describe the bug
We have over 1500 applications, in Gitlab when we have configured a webhook it has a maximum timeout of 10s. Running the webhook locally its taking around 15s, meaning gitlab is timing out and disabling the webhook.
Is there anyway of speeding this up?
To Reproduce
Create over 1500 applications in argocd, call the webhook url within postman with a valid push event, time how long it takes.
Expected behavior
Returns with a response within 10s
Version
v2.6.11+697fd7c
We had similar problems in the past. Clusters with ~500 apps, the webhook request take about 10 seconds to process all. With the latest version that time has been decreased to about 5s.
If I am not mistaken in a webhook request ArgoCD processes all applications sequentially instead to parallelise it.
Is there any reason not to do it in parallel?
I'm fairly certain it can and should be paralleled.
I'm surprised it takes more than a second. Feels a bit like the processing loop is probably doing some network-bound work as part of each iteration. That would probably be worth some investigation.
Faced the same issue, webhook calls taking >10s to finish.
It is surely doing quite some network-bound work here, potentially calling argo.RefreshApp()
or storePreviouslyCachedManifests()
in each iteration.
https://github.com/argoproj/argo-cd/blob/eb526ff1bdddea09c8dfd90373968fa85ac48b4f/util/webhook/webhook.go#L291-L311
While doing the refreshes concurrently should solve the issue since it will be much faster, I was looking at the code and it seems like the result of the operation is not used in any way: https://github.com/argoproj/argo-cd/blob/f33005b10427c9894d1830476423fc36b412debb/util/webhook/webhook.go#L512
I'm wondering if we should instead return a response as soon as the request is validated and do the actual processing in parallel. I don't see why we need to make the webhook wait for the operation to complete if we don't need the result in the response.
I also imagined similar thing and thought that we'd ideally need some kind of queue for the webhook requests, but this would make the argocd-server somewhat stateful. We could optionally use the existing redis as the queue and run background workers inside the argocd-server, not stateful but not sure if we want to do this (additionally redis is not just a cache now).
I'm wondering if we should instead return a response as soon as the request is validated and do the actual processing in parallel.
Thinking about it again, maybe we could just put all processing to the background and then return 200 immediately to webhook clients. The webhook handler anyway just runs to tell argocd-app-controller to refresh specific apps by adding the refresh-annotation. The actual "queue" is all those apps with their refresh-annotation set. But in this setup, we confidently assume that the argocd-server will always finish putting apps into the "queue". Not sure if this is ok, especially if webhook clients might want to do something (e.g. retry or notify/alerts) in case the webhook server crashes while still adding refresh-annotations to apps.
Also, I think we still want the would-be-background app-refreshes to finish quickly (with concurrent processing).
I'm running into this too. I don't know that much about ArgoCD internals, but I think refreshing the apps in a background thread and returning a response immediately could be a decent solution - even if the processing fails to complete, it's not the end of the world - apps get refreshed periodically anyway, so "best effort" seems alright (to me anyway).
We are running into this too where our webhook requests are being timed out at argocd. Has there been any update on the fix? I see the PR https://github.com/argoproj/argo-cd/pull/15326 has no updates since 4 months.
Any update on this bug? or any workaround like increase the timeout?
I have raised a PR to do processing in background in https://github.com/argoproj/argo-cd/pull/18173
Hello everyone, Just to give an update, we are getting the same situation using GitHub Enterprise but it seems that the webhook events are processed in a few seconds by ArgoCD. We can see the commits after a few seconds of pushing them. In this case I think it's just that the response takes more than the 10 seconds needed to be sent to Github or Gitlab so it is considered as not delivered, altought the payload is processed. Thanks for the PR!
We are seeing also the problem on GitLab.com we have over 500 Apps and the webhook takes mostly between 10.2 - 10.1
seconds. It would be nice to have implemented upstream.
This results in the enad that GitLab will disable our Webhook integration.
We don't have that many applications (76) and are also hitting the timeout (on a private Gitlab instance) (we do have a complex application set, with several matrix generators (combining git + clusters) nested in a merge generators, so I don't know if this is a factor)