actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Intermittent GHA Listener Failures

Open jb-2020 opened this issue 7 months ago • 2 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Deploy helm charts
Wait for the listener to restart

Describe the bug

Intermittently some of our listener pods will become unresponsive for 15-20 minutes. This is surfaced as long queue times for Workflows. This occurs ~4 times a day and usually correlated with load on the GHES server. It seems to happen in 'waves' impacting roughly ~90% of our listeners.

Observed behavior:

  1. The listener will throw a context deadline exceeded (Client.Timeout exceeded while awaiting headers) - this error is repeated 3 times with 5 minute pauses between the events.
  2. The listener throws: read tcp <REDACTED>:41054-><REDACTED>:443: read: connection timed out
  3. One of the following occurs:
    • The controller restarts the listener pod and it comes back as healthy
    • No error message is thrown and the listener continues on as expected
    • The listener throws: Message queue token is expired during GetNextMessage, refreshing... and it continues on as expected.

During step 1 the listener is not functional and causes 15-20 minute down times.

Should this timeout be set to 1 minute? Is 5 minutes too long?

Note: We do not observe any other connectivity issues with our instance of GHES. We are investigating issues with our connectivity to GHES and the resiliency of the server and compatibility with HTTP long polls. With that said, I think there may be an opportunity here to make the listeners more resilient to networking blips.

Describe the expected behavior

The listener is not restarted by the controller and doesn't become unresponsive for 15-20 minutes.

Additional Context

GHES Version: 3.9

Controller Logs

Listener logs: https://gist.github.com/jb-2020/13f246a361f039a54733f90f270eeafa

Controller logs: https://gist.github.com/jb-2020/18c6f276fd351e4f09ac894e545258e6

Runner Pod Logs

N/A

jb-2020 avatar Jul 25 '24 23:07 jb-2020