woodpecker icon indicating copy to clipboard operation
woodpecker copied to clipboard

Errors with large number of multiple workflows on GitHub (rate limits and timeouts)

Open fernandrone opened this issue 6 months ago • 3 comments

Component

server

Describe the bug

We've been experience two issues with repositories that have a "large" number of workflows: I have a repository with 41 workflows and another with 61. They're certainly outliers... but they exist.

  1. Secondary rate limits. They will show up as this on the server as:
{"level":"error","error":"POST https://api.github.com/repos/<org>/<repo>/statuses/<hash>: 403 You have exceeded a secondary rate limit. Please wait a few minutes before you try again. If you reach out to GitHub Support for help, please include the request ID <id>.","time":"2024-08-21T13:31:15Z","message":"error setting commit status for <org>/<repo>/11082"}

There's also a variation could not get folder from forge.

If this is resolved in a timely manner (in less than 10 seconds) then the server replies with a 400 and a "failure to parse hook" message. We have observed that this is a relatively common occurrence, repositories with 20 or so workflows will receive "a few" of those daily.

  1. A more general issue is timeouts. GitHub has a 10 second timeout on webhooks. If the server takes too long to request all the workflows (wether or not it is rate limited) and doesn't reply within 10 second then the webhook will show as a "504".

From what I've seem both issues are often correlated. A secondary rate limit will likely cause the webhook to timeout. That said it seems that a timeout alone is possible and itself not destructive. If a webhook timeout happens but the Woodpecker server does manage to parse all workflows (even if it takes more than 10 seconds) and process them, the job is created, picked up by an agent and even updates the github UI.

However when secondary rate limits happens, its worse. The server code doesn't handle rate limits. If it's could not get folder from forge error, then the pipeline job is not created. From the user's perspective this is a silent error. They make a push but no pipeline is started, so it can be very hard to debug without access to the server logs. There's also error setting commit status; in this case it seems to be implied (didn't get to track one down) that the success/failure/running status wouldn't be updated on the commit which could cause issues with pull request validation.

FWIW I'm testing this internally https://github.com/quintoandar/woodpecker/pull/20 which uses a github secondary rate limit library to try and fix the first issue. Looks promising.

Fixing timeouts seems more complex. Arguably it might just be a hard limit on Woodpecker, maybe it makes no sense to support 40+ workflows. Or it'd need to process them asynchronously through an internal queue (but then there's the scenario where Woodpecker replies 200 to GitHub and later asynchronously finds ou some of the workflows are invalid).

Steps to reproduce

  1. Install woodpecker and configure GitHub as the forge
  2. To force the secondary rate limit and/or timeout create a test repository with 100+ workflows (doesn't matter what they do)

Expected behavior

No response

System Info

Woodpecker 2.7.0, installed on Kubernetes, GitHub forge

Additional context

No response

Validations

  • [X] Read the docs.
  • [X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • [X] Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]

fernandrone avatar Aug 21 '24 19:08 fernandrone