actions-runner-controller
actions-runner-controller copied to clipboard
Intermittent GHA Listener Failures
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
Deploy helm charts
Wait for the listener to restart
Describe the bug
Intermittently some of our listener pods will become unresponsive for 15-20 minutes. This is surfaced as long queue times for Workflows. This occurs ~4 times a day and usually correlated with load on the GHES server. It seems to happen in 'waves' impacting roughly ~90% of our listeners.
Observed behavior:
- The listener will throw a
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
- this error is repeated 3 times with 5 minute pauses between the events. - The listener throws:
read tcp <REDACTED>:41054-><REDACTED>:443: read: connection timed out
- One of the following occurs:
- The controller restarts the listener pod and it comes back as healthy
- No error message is thrown and the listener continues on as expected
- The listener throws:
Message queue token is expired during GetNextMessage, refreshing...
and it continues on as expected.
During step 1 the listener is not functional and causes 15-20 minute down times.
Should this timeout be set to 1 minute? Is 5 minutes too long?
Note: We do not observe any other connectivity issues with our instance of GHES. We are investigating issues with our connectivity to GHES and the resiliency of the server and compatibility with HTTP long polls. With that said, I think there may be an opportunity here to make the listeners more resilient to networking blips.
Describe the expected behavior
The listener is not restarted by the controller and doesn't become unresponsive for 15-20 minutes.
Additional Context
GHES Version: 3.9
Controller Logs
Listener logs: https://gist.github.com/jb-2020/13f246a361f039a54733f90f270eeafa
Controller logs: https://gist.github.com/jb-2020/18c6f276fd351e4f09ac894e545258e6
Runner Pod Logs
N/A