community
community copied to clipboard
Sporadic delays with github runners starting workflow executions
Reports made in the #otel-maintainers Slack channel:
- March 21, 9:37a Pacific Time: https://cloud-native.slack.com/archives/C01NJ7V1KRC/p1742575048528319
- Tue, Apr 8, 9:13a Pacific Time: https://cloud-native.slack.com/archives/C01NJ7V1KRC/p1744128838919859
- Wed, Apr 9 7:19a Pacific Time: https://cloud-native.slack.com/archives/C01NJ7V1KRC/p1744208353371499?thread_ts=1744128838.919859&cid=C01NJ7V1KRC
Using the data now available from #2606 (thanks @adrielp!), we can get proxy data for these delays by measuring one of the collector contrib workflows that runs often and generally runs very fast
This chart represents the number of executions > 2 minute of the "Add code owners to a PR" job in the collector contrib repo:
I've opened a CNCF service desk ticket asking if they can look into it since they own the github runner limits, I'm just opening this as a tracking issue.
I added project-infra to this, even though the label description says that this is for 'non-Github' issues only. @trask do you think there is a better area label for this?
@trask I've picked up your Service Desk request looking at this now.
There are basic liveness checks that I can carry out in realtime at the enterprise settings level when you are experiencing delays in getting jobs serviced.
I am happy to escalate delays up to the team in GitHub in real time to see if we can get more detailed information from them as to why jobs are sitting in pending queues for a long time.
Reach out to me the next time this happens, listing the jobs and runners that are pending and I will see what I can do.
Edit: A reminder that in realtime, status.github.com should be checked to see if there a known operational issue.
We haven't seen this lately, closing, but definitely let us know if anyone experiences this going forward