argo-events Bitbucket event source querying bitbucket api too often

After adding more than 20 repos using the same credentials, I start getting 429 Too Many Requests which prevents webhooks from being created.

Environment (please complete the following information):

Kubernetes: 1.21
Argo: 3.3.6
Argo Events: v1.6.3

@daniel-codefresh

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

Jul 26 '22 13:07 dave08

@dave08 The reason this is happening is that bitbucket eventsource (as well as the rest of the git related eventsources) has an “auto recovery” daemon for webhooks that is running each minute and is checking if the webhook already exists and if not it creates it. It appears that bitbucket's API rate limit is 1000 webhook requests per hour: https://support.atlassian.com/bitbucket-cloud/docs/api-request-limits/ 20 repos * 60 mins ~= 1200 requests per hour...

Jul 26 '22 16:07 daniel-codefresh

@whynowy I think I have a solution for our race condition issue with deleteHookOnFinish which would help us get rid of the webhooks daemon mitigation once and for all, but by looking at this comment on the PR related to github eventsource webhooks daemon it seems to me that for some users the mitigation has become a feature. I'm wondering if we should keep the daemon anyway and provide an option to enable/configure it in case it is needed, wdyt?

Jul 26 '22 16:07 daniel-codefresh

@whynowy I think I have a solution for our race condition issue with deleteHookOnFinish which would help us get rid of the webhooks daemon mitigation once and for all, but by looking at this comment on the PR related to github eventsource webhooks daemon it seems to me that for some users the mitigation has become a feature. I'm wondering if we should keep the daemon anyway and provide an option to enable/configure it in case it is needed, wdyt?

What is your solution?

Make it configurable is also an option I think.

Jul 26 '22 19:07 whynowy

@whynowy I think I have a solution for our race condition issue with deleteHookOnFinish which would help us get rid of the webhooks daemon mitigation once and for all, but by looking at this comment on the PR related to github eventsource webhooks daemon it seems to me that for some users the mitigation has become a feature. I'm wondering if we should keep the daemon anyway and provide an option to enable/configure it in case it is needed, wdyt?

What is your solution?

Make it configurable is also an option I think.

The solution is quite simple: The main reason that we are facing a race condition when deleteHookOnFinish is enabled is that the new eventsource pod is trying to use the same hook that could possibly be deleted by the old eventsource pod. My suggestion is that every new eventsource pod would "recreate" the hook incase it already exists (delete the old hook and create a new one) instead of using the previous one, so that each eventsource pod would work against a different hook subscribtion with a different ID. This way no "collision" can occur. WDYT? @whynowy

Jul 31 '22 14:07 daniel-codefresh

Any progress on this one @daniel-codefresh? It's really frustrating to have to keep on removing other event sources to add new ones... and being that this is dependent on the argo events release cycle, it would be great to have it in the upcoming version already... or maybe there's some way to make a seperate plugin for this?

Aug 17 '22 11:08 dave08

The main reason that we are facing a race condition when deleteHookOnFinish is enabled is that the new eventsource pod is trying to use the same hook that could possibly be deleted by the old eventsource pod. My suggestion is that every new eventsource pod would "recreate" the hook incase it already exists (delete the old hook and create a new one) instead of using the previous one, so that each eventsource pod would work against a different hook subscribtion with a different ID.

@daniel-codefresh: Wouldn't it be easier to just set the deployment strategy of the Git-like EventSources to Recreate instead of RollingUpdate? There's already support for that.

The RecreateStrategyEventSources was (I think) originally created to handle EventSources that don't support multiple subscriptions at the same time, but I suppose it can be used to eliminate the race condition in the webhook creation/deletion of the github/gitlab/bitbucket EventSources as well. We can then get rid of the goroutines.

Another reason that some people like the "auto fixing" goroutine is that it acts as a reconsiliation loop, recreating webhooks that are accidentally deleted, for example. But I agree with @whynowy that the EventSource is not the right place to handle this.

Aug 18 '22 22:08 svmaris

The main reason that we are facing a race condition when deleteHookOnFinish is enabled is that the new eventsource pod is trying to use the same hook that could possibly be deleted by the old eventsource pod. My suggestion is that every new eventsource pod would "recreate" the hook incase it already exists (delete the old hook and create a new one) instead of using the previous one, so that each eventsource pod would work against a different hook subscribtion with a different ID.

@daniel-codefresh: Wouldn't it be easier to just set the deployment strategy of the Git-like EventSources to Recreate instead of RollingUpdate? There's already support for that.

The RecreateStrategyEventSources was (I think) originally created to handle EventSources that don't support multiple subscriptions at the same time, but I suppose it can be used to eliminate the race condition in the webhook creation/deletion of the github/gitlab/bitbucket EventSources as well. We can then get rid of the goroutines.

Another reason that some people like the "auto fixing" goroutine is that it acts as a reconsiliation loop, recreating webhooks that are accidentally deleted, for example. But I agree with @whynowy that the EventSource is not the right place to handle this.

@svmaris I think that Recreate strategy for webhook related eventsources is an idea that might be worth considering, although it has some downsides compared to the first solution:

It would introduce a bit of downtime to all webhook related eventsources which may lead to loss of webhook events and affect the overall reliability of the solution.
Although this might be an edge case, the race condition would still exist for users who delete eventsource pods manually instead of managing them through the deployment, as Recreate strategy only applies for deployment upgrades.

In regards to the current "support" of Recreate strategy that you've mentioned, I'll let @whynowy comment on that as he’s way more familiar with that code but AFAIK we only use RollingUpdate strategy in all of our eventsources ATM. This code snippet might be a bit misleading though, and is probably a leftover of some legacy code. This seems to me more like a list of the eventsources for which we run leader elections.

Aug 20 '22 15:08 daniel-codefresh

It would introduce a bit of downtime to all webhook related eventsources which may lead to loss of webhook events and affect the overall reliability of the solution.

Agreed. Of course, this also happens in the current situation, where it can take up to a minute for the webhook to be recreated.

Although this might be an edge case, the race condition would still exist for users who delete eventsource pods manually instead of managing them through the deployment, as Recreate strategy only applies for deployment upgrades.

This is also true. Not only with manual Pod deletions, but also with forced evictions like issues with the underlying node, running on pre-emptive nodes, other issues that might occur... so this might not even be an edge case.

Regarding the solution you're proposing: If a new EventSource Pod is starting, it has no idea what id the previous Webhook had. Are you proposing to find the Webhook by it's name, deleting it, and creating a new one with the same name?

Aug 20 '22 15:08 svmaris

In regards to the current "support" of Recreate strategy that you've mentioned, I'll let @whynowy comment on that as he’s way more familiar with that code but AFAIK we only use RollingUpdate strategy in all of our eventsources ATM. This code snippet might be a bit misleading though, and is probably a leftover of some legacy code. This seems to me more like a list of the eventsources for which we run leader elections.

Ah, I see now ... it is indeed a bit misleading. I have a File EventSource (which is on the list), but the Deployment is still running with the RollingUpdate strategy.

Aug 20 '22 15:08 svmaris

Regarding the solution you're proposing: If a new EventSource Pod is starting, it has no idea what id the previous Webhook had. Are you proposing to find the Webhook by it's name, deleting it, and creating a new one with the same name?

Find the webhook by it's repo and URL (not the name), which is exactly what we are doing today to find out if a webhook already exists and requires an update, only instead of replacing the whole existing webhook configuration I propose to just delete and recreate it.

Aug 20 '22 15:08 daniel-codefresh

@whynowy I think I have a solution for our race condition issue with deleteHookOnFinish which would help us get rid of the webhooks daemon mitigation once and for all, but by looking at this comment on the PR related to github eventsource webhooks daemon it seems to me that for some users the mitigation has become a feature. I'm wondering if we should keep the daemon anyway and provide an option to enable/configure it in case it is needed, wdyt?

What is your solution? Make it configurable is also an option I think.

The solution is quite simple: The main reason that we are facing a race condition when deleteHookOnFinish is enabled is that the new eventsource pod is trying to use the same hook that could possibly be deleted by the old eventsource pod. My suggestion is that every new eventsource pod would "recreate" the hook incase it already exists (delete the old hook and create a new one) instead of using the previous one, so that each eventsource pod would work against a different hook subscribtion with a different ID. This way no "collision" can occur. WDYT? @whynowy

@dave08 @svmaris @whynowy I've created a POC for this solution #2145

Aug 20 '22 16:08 daniel-codefresh

@daniel-codefresh Did this make it in the new release 1.7.2? I'm already starting to get desperate trying to juggle around this... thanks for solving this though!

Sep 13 '22 13:09 dave08

@daniel-codefresh Did this make it in the new release 1.7.2? I'm already starting to get desperate trying to juggle around this... thanks for solving this though!

Yes, it did: https://github.com/argoproj/argo-events/releases/tag/v1.7.2

Looking forward to hear your feedback @dave08 !

Sep 13 '22 16:09 daniel-codefresh