Worker logs showing same jobId for different workflows/jobs and it's causing the registration of the runner to fail and the job to become orphaned
Describe the bug
We install the latest runner into a macOS VM and it registers it to the org level so that jobs in a repo under the org can use it. This works 99% of the time, but then randomly we get a problem where the actions-runner in the VM doesn't register fully and the job runs indefinitely.
The runner logs show
[2024-12-12 16:30:47Z INFO ProcessInvokerWrapper] Finished process 729 with exit code 102, and elapsed time 00:00:11.4918642.
[2024-12-12 16:30:47Z INFO JobDispatcher] Worker finished for job a4127350-4039-54ee-311c-0227a78ac3cb. Code: 102
[2024-12-12 16:30:47Z INFO JobDispatcher] finish job request for job a4127350-4039-54ee-311c-0227a78ac3cb with result: Failed
[2024-12-12 16:30:47Z INFO Terminal] WRITE LINE: 2024-12-12 16:30:47Z: Job testJob completed with result: Failed
[2024-12-12 16:30:47Z INFO JobDispatcher] Stop renew job request for job a4127350-4039-54ee-311c-0227a78ac3cb.
[2024-12-12 16:30:47Z INFO JobDispatcher] job renew has been cancelled, stop renew job request 7662.
[2024-12-12 16:30:47Z INFO JobNotification] Entering JobCompleted Notification
[2024-12-12 16:30:47Z INFO JobNotification] Entering EndMonitor
[2024-12-12 16:30:47Z INFO MessageListener] Received job status event. JobState: Online
[2024-12-12 16:30:47Z INFO JobDispatcher] Fire signal for one time used runner.
[2024-12-12 16:30:47Z INFO Runner] Job has finished at backend, the runner will exit since it is running under onetime use mode.
[2024-12-12 16:30:47Z INFO Runner] Stop message queue looping.
[2024-12-12 16:30:47Z WARN GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=90e53455-5052-4e98-8c94-92d824f3f5f0&status=Busy&runnerVersion=2.321.0&os=macOS&architecture=ARM64&disableUpdate=true has been cancelled.
[2024-12-12 16:30:47Z ERR BrokerServer] Catch exception during request
[2024-12-12 16:30:47Z ERR BrokerServer] System.Threading.Tasks.TaskCanceledException: The operation was canceled.
---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
---> System.Net.Sockets.SocketException (89): Operation canceled
--- End of inner exception stack trace ---
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken, Int32 estimatedSize)
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory`1 buffer, CancellationToken cancellationToken)
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.AuthenticationHelper.SendWithNtAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean isProxyAuth, HttpConnection connection, HttpConnectionPool connectionPool, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at GitHub.Services.Common.RawHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)
at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
at Sdk.WebApi.WebApi.RawHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
at Sdk.WebApi.WebApi.RawHttpClientBase.SendAsync[T](HttpRequestMessage message, Boolean readErrorBody, Object userState, CancellationToken cancellationToken)
at Sdk.WebApi.WebApi.RawHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Uri requestUri, HttpContent content, IEnumerable`1 queryParameters, Boolean readErrorBody, Object userState, CancellationToken cancellationToken)
at GitHub.Actions.RunService.WebApi.BrokerHttpClient.GetRunnerMessageAsync(Nullable`1 sessionId, String runnerVersion, Nullable`1 status, String os, String architecture, Nullable`1 disableUpdate, CancellationToken cancellationToken)
at GitHub.Runner.Common.BrokerServer.<>c__DisplayClass7_0.<<GetRunnerMessageAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
at GitHub.Runner.Common.RunnerService.RetryRequest[T](Func`1 func, CancellationToken cancellationToken, Int32 maxRetryAttemptsCount, Func`2 shouldRetry)
Per https://github.com/orgs/community/discussions/146348#discussioncomment-11546552, I notice that the actions-runner logs, even for jobs that are unrelated, are all showing the same jobId used:
Every 2.0s: grep -r '"jobId":' /tmp/anklet-vm-* Nathans-MacBook-Pro.local: Thu Dec 12 10:34:08 2024
/tmp/anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a/Worker_20241212-163036-utc.log: "jobId": "a4127350-4039-54ee-311c-0227a78ac3cb",
/tmp/anklet-vm-46d212e2-48c1-4c16-9bd3-24af3c0420d4/Worker_20241212-163225-utc.log: "jobId": "b5f34950-b701-5746-fee6-8f2a0b68b3f1",
/tmp/anklet-vm-9e1ab1c2-3601-4ca8-ab5e-77f3fd2350f6/Worker_20241212-163314-utc.log: "jobId": "a4127350-4039-54ee-311c-0227a78ac3cb",
/tmp/anklet-vm-ad54e62a-054f-484f-8053-1c15a7d7a4c8/Worker_20241212-163344-utc.log: "jobId": "a4127350-4039-54ee-311c-0227a78ac3cb",
/tmp/anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84/Worker_20241212-163041-utc.log: "jobId": "a0de1d4e-2247-5d6c-d7e9-3e3aaa8d4597",
/tmp/anklet-vm-e7b1e26a-947b-469a-a305-934aa0cb5a85/Worker_20241212-163149-utc.log: "jobId": "23d0dcb9-468e-5c44-e20e-7d90f1a0de55",
It looks like the jobId is unique for a while, then starts being reused. You can see a4127350-4039-54ee-311c-0227a78ac3cb is reused here.
The way I configure the runners inside of the VMs is like the following. I made sure the Token is unique for each, as well as the name, labels, etc.
./config.sh --name anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a --token XXXXX --url https://github.com/veertuinc --no-default-labels --labels anka-template:d792c6f6-198c-470f-9526-9c998efe7ab4,anka-template-tag:vanilla+port-forward-22+brew-git,unique-id:12300647430-2 --disableupdate --unattended --ephemeral
Here are the two jobs that are using the same jobId
THE JOB THAT NEVER GOT REGISTERED
{
"time": "2024-12-12T10:30:15.555763-06:00",
"level": "DEBUG",
"msg": "executing",
"command": "anka -j run anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a bash -c ./register-runner.bash anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a BJ7F3UI2A5Z3EWK5B3R... https://github.com/veertuinc anka-template:d792c6f6-198c-470f-9526-9c998efe7ab4,anka-template-tag:vanilla+port-forward-22+brew-git,unique-id:12300647430-2 ",
"ankletVersion": "dev",
"pluginName": "RUNNER1",
"plugin": "github",
"owner": "veertuinc",
"repo": "anklet",
"workflowJobID": 34329323592,
"workflowJobName": "t1-with-tag-1-matrix-nodes-2 (2)",
"workflowJobRunID": 12300647430,
"workflowName": "t1-with-tag-1-matrix-nodes-2",
"jobURL": "https://github.com/veertuinc/anklet/actions/runs/12300647430/job/34329323592",
"ankaTemplate": "d792c6f6-198c-470f-9526-9c998efe7ab4",
"ankaTemplateTag": "vanilla+port-forward-22+brew-git",
"vmName": "anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a"
}
THE NEXT REGISTRATION THAT HAD THE SAME jobId (and worked)
{
"time": "2024-12-12T10:30:20.453382-06:00",
"level": "DEBUG",
"msg": "executing",
"command": "anka -j run anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84 bash -c ./register-runner.bash anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84 BJ7F3UMDJHL2MEQV6... https://github.com/veertuinc anka-template:d792c6f6-198c-470f-9526-9c998efe7ab4,anka-template-tag:vanilla+port-forward-22+brew-git ",
"ankletVersion": "dev",
"pluginName": "RUNNER2",
"plugin": "github",
"owner": "veertuinc",
"repo": "anklet",
"workflowJobID": 34329321785,
"workflowJobName": "testJob",
"workflowJobRunID": 12300647002,
"workflowName": "t1-failure-tag-1-in-vm",
"jobURL": "https://github.com/veertuinc/anklet/actions/runs/12300647002/job/34329321785",
"ankaTemplate": "d792c6f6-198c-470f-9526-9c998efe7ab4",
"ankaTemplateTag": "vanilla+port-forward-22+brew-git",
"vmName": "anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84"
}
Here are the _diags for those runners
anklet-vm-cc309b23-4f58-49b4-9406-b429a43b6a84.zip anklet-vm-314c9bc4-3a12-479a-8945-fc5750c7574a.zip
Really unsure where to go with this one. Why are the jobId reused? Is this something I can control? Or is this a bug in the actions-runner?
To Reproduce
Run a ton of jobs in the same repo with a self-hosted runner that's ephemeral and registered after a queued webhook is created.
Runner Version and Platform
v2.321.0 macOS
bump
bump
bummmpppp
Any update on this?