test-infra
test-infra copied to clipboard
Bot account hitting rate limit (again?)
We are seeing more {"message":"API rate limit exceeded for user ID 1617424. If you reach out to GitHub Support for help, please include the request ID 9E6E:0A32:178BB2:2A26F7:65FB3545.","documentation_url":"https://docs.github.com/rest/overview/rate-limits-for-the-rest-api"} error from CI logs, i.e. https://ossci-raw-job-status.s3.amazonaws.com/log/22898061143. This is a new issue and we need to take a closer look to find out the root cause and potential fixes.
cc @clee2000 @PaliC @kit1980 @malfet
I'm pretty sure this is suo's token in the github-status-test lambda
PATs have rate limit of 5000/user/hr
They get refreshed every hour, so this problem will resolve itself and then maybe show up in an hour. A look on rockset says that we are occasionally hitting 4900+ workflow jobs on pytorch during peak working hours (this doesn't include all the other repos but idk which ones are sending webhooks here), but this averages to ~1600/hr over the entire week
Here are some possible solutions:
- Move log download to pytorchbot app (rate limit of 15000?) - pros: larger limit, cons: pytorchbot also has other things it needs to use the api for and I'm not as sure how to know when the bot hits a rate limit
- Add more tokens (we have at least two bot accounts who's tokens we could use instead of suos) - cons: scale via adding accounts
- Reduce log download in general - pros: permanent decrease to number of log downloads, cons: delayed log downloads
- Some variation of this where we delay log downloads until someone asks for it (ex click a button on HUD, maybe you have to be authenticated?) or until the workflow is finished (https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28#download-workflow-run-logs) or slowly backfill the logs
I guess we need to stop using Suo's token anyway, that's not a good practice to keep. So we can do the ~~first~~ second point and observe to see if we need a new ~~bot~~ account?
It seems easier to use the second approach using our other bot accounts that have PAT. Don't need to go through the pytorchbot app route which will requires OIDC connection from Vercel to our AWS account, which is not supported atm https://github.com/pytorch/test-infra/issues/4789#issuecomment-1904720271
Here are some docs on how to authenticate as a Github App: https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/authenticating-as-a-github-app
Added ability to use more tokens in https://github.com/pytorch/test-infra/pull/5033 Still need to find another token from a bot to add Another option
- Swap from lambda to gha to take advantage of repo level token limits (and also not use PATs)
- Make another bot at the org level just for this