runner icon indicating copy to clipboard operation
runner copied to clipboard

Worker logs showing same jobId for different workflows/jobs and it's causing the registration of the runner to fail and the job to become orphaned

Open NorseGaud opened this issue 1 year ago • 3 comments

Describe the bug

We install the latest runner into a macOS VM and it registers it to the org level so that jobs in a repo under the org can use it. This works 99% of the time, but then randomly we get a problem where the actions-runner in the VM doesn't register fully and the job runs indefinitely.

The runner logs show

[2024-12-12 16:30:47Z INFO ProcessInvokerWrapper] Finished process 729 with exit code 102, and elapsed time 00:00:11.4918642.
[2024-12-12 16:30:47Z INFO JobDispatcher] Worker finished for job a4127350-4039-54ee-311c-0227a78ac3cb. Code: 102
[2024-12-12 16:30:47Z INFO JobDispatcher] finish job request for job a4127350-4039-54ee-311c-0227a78ac3cb with result: Failed
[2024-12-12 16:30:47Z INFO Terminal] WRITE LINE: 2024-12-12 16:30:47Z: Job testJob completed with result: Failed
[2024-12-12 16:30:47Z INFO JobDispatcher] Stop renew job request for job a4127350-4039-54ee-311c-0227a78ac3cb.
[2024-12-12 16:30:47Z INFO JobDispatcher] job renew has been cancelled, stop renew job request 7662.
[2024-12-12 16:30:47Z INFO JobNotification] Entering JobCompleted Notification
[2024-12-12 16:30:47Z INFO JobNotification] Entering EndMonitor
[2024-12-12 16:30:47Z INFO MessageListener] Received job status event. JobState: Online
[2024-12-12 16:30:47Z INFO JobDispatcher] Fire signal for one time used runner.
[2024-12-12 16:30:47Z INFO Runner] Job has finished at backend, the runner will exit since it is running under onetime use mode.
[2024-12-12 16:30:47Z INFO Runner] Stop message queue looping.
[2024-12-12 16:30:47Z WARN GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=90e53455-5052-4e98-8c94-92d824f3f5f0&status=Busy&runnerVersion=2.321.0&os=macOS&architecture=ARM64&disableUpdate=true has been cancelled.
[2024-12-12 16:30:47Z ERR  BrokerServer] Catch exception during request
[2024-12-12 16:30:47Z ERR  BrokerServer] System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
 ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
 ---> System.Net.Sockets.SocketException (89): Operation canceled
   --- End of inner exception stack trace ---
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
   at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken, Int32 estimatedSize)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory`1 buffer, CancellationToken cancellationToken)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at System.Net.Http.AuthenticationHelper.SendWithNtAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean isProxyAuth, HttpConnection connection, HttpConnectionPool connectionPool, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
   at GitHub.Services.Common.RawHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at Sdk.WebApi.WebApi.RawHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
   at Sdk.WebApi.WebApi.RawHttpClientBase.SendAsync[T](HttpRequestMessage message, Boolean readErrorBody, Object userState, CancellationToken cancellationToken)
   at Sdk.WebApi.WebApi.RawHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Uri requestUri, HttpContent content, IEnumerable`1 queryParameters, Boolean readErrorBody, Object userState, CancellationToken cancellationToken)
   at GitHub.Actions.RunService.WebApi.BrokerHttpClient.GetRunnerMessageAsync(Nullable`1 sessionId, String runnerVersion, Nullable`1 status, String os, String architecture, Nullable`1 disableUpdate, CancellationToken cancellationToken)
   at GitHub.Runner.Common.BrokerServer.<>c__DisplayClass7_0.<<GetRunnerMessageAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at GitHub.Runner.Common.RunnerService.RetryRequest[T](Func`1 func, CancellationToken cancellationToken, Int32 maxRetryAttemptsCount, Func`2 shouldRetry)

Per https://github.com/orgs/community/discussions/146348#discussioncomment-11546552, I notice that the actions-runner logs, even for jobs that are unrelated, are all showing the same jobId used:

Every 2.0s: grep -r '"jobId":' /tmp/anklet-vm-*                                        Nathans-MacBook-Pro.local: Thu Dec 12 10:34:08 2024

/tmp/anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a/Worker_20241212-163036-utc.log:  "jobId": "a4127350-4039-54ee-311c-0227a78ac3cb",
/tmp/anklet-vm-46d212e2-48c1-4c16-9bd3-24af3c0420d4/Worker_20241212-163225-utc.log:  "jobId": "b5f34950-b701-5746-fee6-8f2a0b68b3f1",
/tmp/anklet-vm-9e1ab1c2-3601-4ca8-ab5e-77f3fd2350f6/Worker_20241212-163314-utc.log:  "jobId": "a4127350-4039-54ee-311c-0227a78ac3cb",
/tmp/anklet-vm-ad54e62a-054f-484f-8053-1c15a7d7a4c8/Worker_20241212-163344-utc.log:  "jobId": "a4127350-4039-54ee-311c-0227a78ac3cb",
/tmp/anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84/Worker_20241212-163041-utc.log:  "jobId": "a0de1d4e-2247-5d6c-d7e9-3e3aaa8d4597",
/tmp/anklet-vm-e7b1e26a-947b-469a-a305-934aa0cb5a85/Worker_20241212-163149-utc.log:  "jobId": "23d0dcb9-468e-5c44-e20e-7d90f1a0de55",

It looks like the jobId is unique for a while, then starts being reused. You can see a4127350-4039-54ee-311c-0227a78ac3cb is reused here.

The way I configure the runners inside of the VMs is like the following. I made sure the Token is unique for each, as well as the name, labels, etc.

./config.sh --name anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a --token XXXXX --url https://github.com/veertuinc --no-default-labels --labels anka-template:d792c6f6-198c-470f-9526-9c998efe7ab4,anka-template-tag:vanilla+port-forward-22+brew-git,unique-id:12300647430-2 --disableupdate --unattended --ephemeral

Here are the two jobs that are using the same jobId

THE JOB THAT NEVER GOT REGISTERED
{
  "time": "2024-12-12T10:30:15.555763-06:00",
  "level": "DEBUG",
  "msg": "executing",
  "command": "anka -j run anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a bash -c ./register-runner.bash anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a BJ7F3UI2A5Z3EWK5B3R... https://github.com/veertuinc anka-template:d792c6f6-198c-470f-9526-9c998efe7ab4,anka-template-tag:vanilla+port-forward-22+brew-git,unique-id:12300647430-2 ",
  "ankletVersion": "dev",
  "pluginName": "RUNNER1",
  "plugin": "github",
  "owner": "veertuinc",
  "repo": "anklet",
  "workflowJobID": 34329323592,
  "workflowJobName": "t1-with-tag-1-matrix-nodes-2 (2)",
  "workflowJobRunID": 12300647430,
  "workflowName": "t1-with-tag-1-matrix-nodes-2",
  "jobURL": "https://github.com/veertuinc/anklet/actions/runs/12300647430/job/34329323592",
  "ankaTemplate": "d792c6f6-198c-470f-9526-9c998efe7ab4",
  "ankaTemplateTag": "vanilla+port-forward-22+brew-git",
  "vmName": "anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a"
}
THE NEXT REGISTRATION THAT HAD THE SAME jobId (and worked)
{
  "time": "2024-12-12T10:30:20.453382-06:00",
  "level": "DEBUG",
  "msg": "executing",
  "command": "anka -j run anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84 bash -c ./register-runner.bash anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84 BJ7F3UMDJHL2MEQV6... https://github.com/veertuinc anka-template:d792c6f6-198c-470f-9526-9c998efe7ab4,anka-template-tag:vanilla+port-forward-22+brew-git ",
  "ankletVersion": "dev",
  "pluginName": "RUNNER2",
  "plugin": "github",
  "owner": "veertuinc",
  "repo": "anklet",
  "workflowJobID": 34329321785,
  "workflowJobName": "testJob",
  "workflowJobRunID": 12300647002,
  "workflowName": "t1-failure-tag-1-in-vm",
  "jobURL": "https://github.com/veertuinc/anklet/actions/runs/12300647002/job/34329321785",
  "ankaTemplate": "d792c6f6-198c-470f-9526-9c998efe7ab4",
  "ankaTemplateTag": "vanilla+port-forward-22+brew-git",
  "vmName": "anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84"
}

Here are the _diags for those runners

anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84.zip anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a.zip

Really unsure where to go with this one. Why are the jobId reused? Is this something I can control? Or is this a bug in the actions-runner?

To Reproduce

Run a ton of jobs in the same repo with a self-hosted runner that's ephemeral and registered after a queued webhook is created.

Runner Version and Platform

v2.321.0 macOS

NorseGaud avatar Dec 12 '24 16:12 NorseGaud

bump

NorseGaud avatar Dec 18 '24 16:12 NorseGaud

bump

NorseGaud avatar Jan 06 '25 14:01 NorseGaud

bummmpppp

NorseGaud avatar Feb 28 '25 02:02 NorseGaud

Any update on this?

dgteixeira avatar Jun 12 '25 12:06 dgteixeira