actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Min Runners Scaling comparison delay cause additional runner to be created and terminated unnecessarily

Open rekha-prakash-maersk opened this issue 1 year ago • 5 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.9.3 , 0.10.1

Deployment Method

Helm

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Set minRunners to 1 (Now we have one runner registered - Runner A)
2. Trigger two jobs simultaneously
3. Now we have two more runners registered ( Runners B & C  online - Total 3)
4. The Runners A and B received the jobs and C is waiting for the job maintaining min runner as 1
5. A complete the job and terminates, now we have new runner D spinning up.
6. Now B and C also gets terminated, and D remains..

Here C is unnecessarily terminated because of the latency in the scaling calculation. it was not executing any job, it was on stand by as per minRunners setup.

Describe the bug

when we set up minRunners, the additional runner which is in standby also gets terminated without even executing any job due to delay in scaling comparison of desired runners

Describe the expected behavior

In this case, the Runner C should remain active and Runner D should not have get created.

Additional Context

minRunners:1
maxRunners:150

Controller Logs

START OF THE 2 JOBS:

EphemeralRunnerSet    Scaling comparison    {"version": "0.10.1", "ephemeralrunnerset": {"name":"sh-ubuntu-latest-np-l","namespace":"arc-runners"}, "current": 3, "desired": 3}  │


AFTER ONE JOB COMPLETED, RUNNEr A TERMINATED (note , desired still shows 3)


EphemeralRunnerSet    Scaling comparison    {"version": "0.10.1", "ephemeralrunnerset": {"name":"sh-ubuntu-latest-np-l","namespace":"arc-runners"}, "current": 2, "desired": 3}   │

NOW RUNNER D GET CREATED(UNNECESSARY RUNNER), 

EphemeralRunnerSet    Scaling comparison    {"version": "0.10.1", "ephemeralrunnerset": {"name":"sh-ubuntu-latest-np-l","namespace":"arc-runners"}, "current": 3, "desired": 2}   │

NOW ANOTHER JOB COMPLETED, ONLY RUNNER C and D ARE ACTIVE, STILL DELAY IN UPDATING DESIRED

EphemeralRunnerSet    Scaling comparison    {"version": "0.10.1", "ephemeralrunnerset": {"name":"sh-ubuntu-latest-np-l","namespace":"arc-runners"}, "current": 2, "desired": 2}    │


NOW WE HAVE RUNNER C and D BOTH ARE WAITING FOR NEW JOB

 EphemeralRunnerSet    Scaling comparison    {"version": "0.10.1", "ephemeralrunnerset": {"name":"sh-ubuntu-latest-np-l","namespace":"arc-runners"}, "current": 2, "desired": 1} │


NOW RUNNER C GOT DELETED 

 EphemeralRunnerSet    Scaling comparison    {"version": "0.10.1", "ephemeralrunnerset": {"name":"sh-ubuntu-latest-np-l","namespace":"arc-runners"}, "current": 1, "desired": 1} │

Runner Pod Logs

LOGS OF RUNNER C waiting for job and gets terminated due to scaling comparison


[RUNNER 2025-01-02 16:42:01Z INFO Terminal] WRITE LINE: 

√ Connected to GitHub
[RUNNER 2025-01-02 16:42:01Z INFO Terminal] WRITE LINE: 

[RUNNER 2025-01-02 16:42:01Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-01-02 16:42:01Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-01-02 16:42:01Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-01-02 16:42:02Z INFO MessageListener] Session created.
[RUNNER 2025-01-02 16:42:02Z INFO Terminal] WRITE LINE: Current runner version: '2.321.0'
Current runner version: '2.321.0'
[RUNNER 2025-01-02 16:42:02Z INFO Terminal] WRITE LINE: 2025-01-02 16:42:02Z: Listening for Jobs
2025-01-02 16:42:02Z: Listening for Jobs
[RUNNER 2025-01-02 16:42:02Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.
[RUNNER 2025-01-02 16:42:02Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/runner/.credentials_rsaparams
[RUNNER 2025-01-02 16:42:02Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
time="2025-01-02T16:53:24.543067339Z" level=info msg="Processing signal 'terminated'"
time="2025-01-02T16:53:24.544010095Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
time="2025-01-02T16:53:24.544396972Z" level=info msg="Daemon shutdown complete"
time="2025-01-02T16:53:24.544453030Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
time="2025-01-02T16:53:24.544463570Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Exiting runner...
[RUNNER 2025-01-02 16:53:24Z INFO Terminal] WRITE LINE: Exiting...
Exiting...
[RUNNER 2025-01-02 16:53:24Z INFO ConfigurationStore] IsServiceConfigured()
[RUNNER 2025-01-02 16:53:24Z INFO ConfigurationStore] IsServiceConfigured: False
[RUNNER 2025-01-02 16:53:24Z INFO HostContext] Runner will be shutdown for UserCancelled
[RUNNER 2025-01-02 16:53:24Z WARN GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=ee4b09ac-0196-4a07-b068-f631314baf4b&status=Online&runnerVersion=2.321.0&os=Linux&architecture=X64&disableUpdate=true has been cancelled.
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer] Catch exception during request
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer] System.Threading.Tasks.TaskCanceledException: The operation was canceled.
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]  ---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]  ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]  ---> System.Net.Sockets.SocketException (125): Operation canceled
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]    --- End of inner exception stack trace ---
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]    at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken, Int32 estimatedSize)
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]    at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]    at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory`1 buffer, CancellationToken cancellationToken)
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]    at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
[RUNNER 2025-01-02 16:53:24Z ERR  BrokerServer]    at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)

rekha-prakash-maersk avatar Jan 02 '25 17:01 rekha-prakash-maersk

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Jan 02 '25 17:01 github-actions[bot]

We are also seeing the same error [ level=info msg="Processing signal 'terminated'" ] but on the dind container, followed by the message level=info msg="Daemon shutdown complete" This causes the the workflows fail with error Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? as the docker daemon is not running anymore.

Did you manage to fix it @rekha-prakash-maersk ?

vineeth-asml avatar Feb 25 '25 17:02 vineeth-asml

Hi @vineeth-asml I believe both are different issue, as the issue I observed is a provisioning of additional runner pod.

For the docker container startup issue, please ensure you have allocated enough resources, also I would suggest to check on the kubectl logs for both docker and runner container to understand what causing the docker container to fail

rekha-prakash-maersk avatar Feb 25 '25 17:02 rekha-prakash-maersk

Hey! Did you find a solution to the error, @vineeth-asml ? We're seeing the same error. It looks like the runner is being killed from some (still) unknown point in the flow.

javisabalete avatar Mar 12 '25 09:03 javisabalete

Hello @javisabalete We did some customization on the startup script to not to start the runner service unless the docker.sock is not running. In our case we heavily rely on docker based actions.

Previously we were using the default startup script from the runner image with template: https://github.com/actions/runner/blob/main/src/Misc/layoutroot/run-helper.sh.template#L22

We customized out base image to use a custom startup template with similar logic as https://github.com/actions/actions-runner-controller/blob/master/runner/startup.sh#L152. This immediately breaks if the docker.sock has not started within RUNNER_WAIT_FOR_DOCKER_IN_SECONDS.

As an additional measure we have reserved resources for the dind container as well as suggested by rekha

vineeth-asml avatar May 01 '25 08:05 vineeth-asml