actions-runner-controller
actions-runner-controller copied to clipboard
Scale Set Listener Stops Responding
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.7.0
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Install both ARC and the ScaleSet helm charts to a kubernetes cluster.
2. Run Github Actions workflows/jobs normally with a self hosted runner.
Beyond this we do not have specific steps to reproduce and would very much appreciate suggestions for how we could gather more information to trigger this bug.
Describe the bug
Symptom:
- Developers reported that jobs were pending for over 20 minutes (in the github workflow job view)
- Restarted the controller and nothing happened
- Deleted the listener and once a new listener pod starts, then new runner pods start. (Alternative we have now discovered that after about an hour, the listener will begin responding again and the logs says "refreshing token")
Diagnostics:
- The listener stops logging
- The only association we can make to alert on is using something like this ((gha_assigned_jobs > 0) and (rate(gha_assigned_jobs[10m]) == 0)) from the metrics provided by the listener.
Another symptom I've noticed: When the controller pod is updated (for instance, between 0.7.0 and 0.8.0) two inconsistencies happen:
- If there are old job pods running, they will block the new listener from starting. I need to delete the pods and do a rollout restart of the controller to get the listener going
- Just updating the deployument doesn't seem to consistently start the listener either. There is always a chance that a rollout restart is needed to get it going.
A separate instance where we observed the issue
Currently running listener gha-runner-scale-set-controller:0.7.0 No logs were being generated, had found a workflow job in a pending state (from the listener logs). 10 Runners were online but the runner group was offline when looking at our org's action runner group settings ( url like this: https://github.com/organizations/bigcorp/settings/actions/runner-groups/9 )
Verified listener metrics had flatlined, for example:
Then the queued job being watched was assigned with no intervention. Below is the section of logs where the listener had stopped and then started working again with no intervention.
2023-12-28T17:10:14Z INFO service process batched runner scale set job messages. {"messageId": 13462, "batchSize": 4}
2023-12-28T17:10:14Z INFO service job started message received. {"RequestId": 1026947, "RunnerId": 1062477}
2023-12-28T17:10:14Z INFO service update job info for runner {"runnerName": "self-hosted-pqbsj-runner-dkmf2", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / sql-lint / lint-migrations", "requestId": 1026947}
2023-12-28T17:10:14Z INFO KubernetesManager Created merge patch json for EphemeralRunner status update {"json": "{\"status\":{\"jobDisplayName\":\"build / sql-lint / lint-migrations\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026947,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z INFO service job started message received. {"RequestId": 1026948, "RunnerId": 1062478}
2023-12-28T17:10:14Z INFO service update job info for runner {"runnerName": "self-hosted-pqbsj-runner-gl8pq", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / build / code build", "requestId": 1026948}
2023-12-28T17:10:14Z INFO KubernetesManager Created merge patch json for EphemeralRunner status update {"json": "{\"status\":{\"jobDisplayName\":\"build / build / code build\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026948,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z INFO service job assigned message received. {"RequestId": 1026947}
2023-12-28T17:10:14Z INFO service job assigned message received. {"RequestId": 1026948}
2023-12-28T17:10:14Z INFO auto_scaler acquiring jobs. {"request count": 0, "requestIds": "[]"}
2023-12-28T17:10:15Z INFO auto_scaler deleted message. {"messageId": 13462}
2023-12-28T17:10:15Z INFO service waiting for message...
2023-12-28T18:10:15Z INFO refreshing_client message queue token is expired during GetNextMessage, refreshing...
2023-12-28T18:10:15Z INFO refreshing token {"githubConfigUrl": "https://github.com/bigcorp"}
2023-12-28T18:10:15Z INFO getting access token for GitHub App auth {"accessTokenURL": "https://api.github.com/app/installations/43625644/access_tokens"}
2023-12-28T18:10:15Z INFO getting runner registration token {"registrationTokenURL": "https://api.github.com/orgs/bigcorp/actions/runners/registration-token"}
2023-12-28T18:10:15Z INFO getting Actions tenant URL and JWT {"registrationURL": "https://api.github.com/actions/runner-registration"}
2023-12-28T18:10:16Z INFO service process message. {"messageId": 13463, "messageType": "RunnerScaleSetJobMessages"}
2023-12-28T18:10:16Z INFO service current runner scale set statistics. {"available jobs": 26, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 10, "busy runners": 0, "idle runners": 10}
2023-12-28T18:10:16Z INFO service process batched runner scale set job messages. {"messageId": 13463, "batchSize": 1}
2023-12-28T18:10:16Z INFO service job completed message received. {"RequestId": 1026947, "Result": "succeeded", "RunnerId": 1062477, "RunnerName": "self-hosted-pqbsj-runner-dkmf2"}
2023-12-28T18:10:16Z INFO auto_scaler acquiring jobs. {"request count": 0, "requestIds": "[]"}
Describe the expected behavior
Ideally the listener would never stop responding.
Additional Context
The only additional thing we tried was using the opt out
button on our github app advanced features. This was kind of a hail mary since we saw the logs related to refreshing the token.
It seems to have helped but maybe that's just a fluke.
Controller Logs
Here's the logs from the time 10 min before through 10 min after the lister stops responding.
https://gist.github.com/jameshounshell/597358f9b0b624d1f80f98057ddddcf0
Runner Pod Logs
N/A
We've had a pretty huge break in the case.
It appears that the listener stopping responding is directly linked to when the token is refreshed.
In the image in yellow you can see every time the token refreshes with the log message "refreshing_client". It appears to be reactive, in response to a failed http call, approximately happening every 60-65m. (which seems to jive with the max token lifetime of 1h for a github app) But before the outage there is no token refresh. Then 1 hour later the token refreshes and the logs begin again (and jobs start running again)
Hey @jameshounshell,
Is it possible that there are network interrupts causing problems? At version 0.8.0, we did add client timeouts to retriable client that could help determine if there are any network issues. Just keep in mind, we are missing metrics on the newest listener app (Will be part of 0.8.2, and to enable metrics, you can use the old listener app as described here)
We can sit tight and wait for 0.8.2 and report back on this issue.
I am seeing this exact same problem with our listener pod. I can also confirm it appears to be related to the token refresh. Its always the last thing logged by the pod before it stops responding.
Hey @jasonluck,
Could you also please post the log so we can inspect it? It is always better to have more evidence of an issue to help us find the root cause
I am going to close this issue until we hear back from you. Please, let us know if the issue is resolved. If not, we can always re-open it :relaxed:
Now that we've updated to the 0.8.2
runner, we are getting a better indication of what happens when the listener pod stops. Instead of just no longer polling for jobs, we now get the pod stopped in an error state. The Controller does not attempt to kill/restart the listener.
Here's the logs that I collected this morning:
{"severity":"info","ts":"2024-02-11T02:27:36Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":557}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Processing message","messageId":558,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":1,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":1,"totalRegisteredRunners":11,"totalBusyRunners":1,"totalIdleRunners":10}}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Job available message received","jobId":1132810}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Acquiring jobs","count":1,"requestIds":"[1132810]"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Message queue token is expired during GetNextMessage, refreshing..."}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"refreshing token","githubConfigUrl":"https://github.com/missionlane"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"getting access token for GitHub App auth","accessTokenURL":"https://api.github.com/app/installations/43625644/access_tokens"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"getting runner registration token","registrationTokenURL":"https://api.github.com/orgs/missionlane/actions/runners/registration-token"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"getting Actions tenant URL and JWT","registrationURL":"https://api.github.com/actions/runner-registration"}
{"severity":"info","ts":"2024-02-11T02:27:58Z","logger":"listener-app.listener","message":"Deleting message session"}
2024/02/11 02:27:58 Application returned an error: failed to handle message: failed to acquire jobs: failed to acquire jobs after session refresh: 400 - had issue communicating with Actions backend: Value cannot be null. (Parameter 'source')
@nikola-jokic Please re-open this ticket :)
We are on 0.8.1 (with the LISTENER_ENTRYPOINT workaround) and I think we saw this today, one of our four listeners just stopped responding. The controller finalised and removed its remaining runner pods as they finished their jobs but didn't produce any errors or otherwise appear to notice. About 45 mins later a node scaledown evicted the controller and it all came back to life after that.
We also saw a token refresh just before this happened, listener logs:
2024-02-14T11:55:41Z INFO refreshing_client message queue token is expired during AcquireJobs, refreshing...
2024-02-14T11:55:41Z INFO refreshing token {"githubConfigUrl": "https://github.com/********"}
2024-02-14T11:55:42Z INFO getting access token for GitHub App auth {"accessTokenURL": "https://api.github.com/app/installations/********/access_tokens"}"
2024-02-14T11:55:48Z INFO getting runner registration token {"registrationTokenURL": "https://api.github.com/orgs/********/actions/runners/registration-token"}"
2024-02-14T11:55:52Z INFO getting Actions tenant URL and JWT {"registrationURL": "https://api.github.com/actions/runner-registration"}"
2024-02-14T11:56:02Z INFO auto_scaler acquired jobs. {"requested": 5, "acquired": 5}
2024-02-14T12:41:20Z INFO Starting metrics server {"address": ":8080"}
We have been using the current version for about a month without seeing it until today, but we are on EKS and upgraded from 1.28 to 1.29 yesterday. If it happens again I may downgrade and see if it goes away.
@RalphSleighK This is the behavior we saw with 0.8.1 as well. If left as it is, it seems to recover after about an hour of being brain dead. Otherwise it takes us deleting the pod to make it happy again.
@ccureau What version of K8s are you using out of curiosity? We've just upgraded to 1.29, wondering if it's a coincidence that we saw this issue post-upgrade.
@angusjellis We're on 1.27.10 in a GKE cluster. We did an upgrade from 1.26 to 1.27 while this was happening without any change in listener behavior for what it's worth.
We got a new error this morning...unsure what to make of it:
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":1298}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app.listener","message":"Deleting message session"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"refreshing token","githubConfigUrl":"https://github.com/missionlane"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"getting access token for GitHub App auth","accessTokenURL":"https://api.github.com/app/installations/43625644/access_tokens"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"getting runner registration token","registrationTokenURL":"https://api.github.com/orgs/missionlane/actions/runners/registration-token"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"getting Actions tenant URL and JWT","registrationURL":"https://api.github.com/actions/runner-registration"}
2024/02/14 11:21:43 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}
With ghcr.io/actions/gha-runner-scale-set-controller:0.8.3
we still get
2024/03/06 13:19:57 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}
once/twice a day (k8s v1.28.5)
With
ghcr.io/actions/gha-runner-scale-set-controller:0.8.3
we still get2024/03/06 13:19:57 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}
once/twice a day (k8s v1.28.5)
Same here with GKE v1.29.1
and gha-runner-scale-set-0.9.0
Re-opening this one for https://github.com/actions/actions-runner-controller/issues/3204#issuecomment-2051417111
Also seeing this here, with 0.9.1. The listener seems to be running but we don't see any new jobs being created in the cluster. What's frustrating is that the log level is already debug by default, and there's no way to get any further detail from the listener to see what might be happening.
I did take a quick look at the listener queue handling code, and while I'm not positive I'm either correct or that the possible problem is even related to this, I wonder if we have a goroutine leak here: https://github.com/actions/actions-runner-controller/blob/master/github/actions/client.go#L610
When a queue message is requested and read, if any of the error paths are followed, we defer the closing of resp.Body
. If it is the happy path, we pass resp.Body
to the json.NewDecoder
(which only accepts an io.Reader
) and never close it. Possibly we tie up all the client connections within leaked goroutines? I might make a PR to address it.
EDIT: I missed the full response body buffering in client.Do
, so this is not a problem.
Hey @ohookins,
If your runners are taking a long time to be scheduled on nodes, it may be related to this issue: https://github.com/actions/actions-runner-controller/issues/3450
@nikola-jokic yes, that sounds very much like it! Thanks :)
I add restartPolicy: Always
to listenerTemplate.spec
in AutoscalingRunnerSet
. It seems like a temporary fix that has helped. The listener container restarts immediately after completing its work with exit 1 due to the issue described here, rather than waiting for the Pod to be deleted, which remains in an Error state the whole time.
So far, one day has passed, and I am waiting for more statistical data to accumulate.
Hey @verdel,
When you can, please provide the log when the listener restarts. Since this has been an issue with the older version of the listener, it may be related to the actions server. In the 0.9.1 release, we added the trace ID which should help us track down the issue.
I'll send the logs a bit later. We are staying on version 0.9.0 as we are waiting for a fix that will be added in this PR.
Lately, I've been seeing the following message in the listener.
2024-04-27T07:07:53Z INFO listener-app.listener Job started message received. {"RequestId": 653192, "RunnerId": 76946}
2024-04-27T07:07:53Z INFO listener-app.listener Job started message received. {"RequestId": 653194, "RunnerId": 76949}
2024-04-27T07:07:53Z INFO listener-app.listener Job started message received. {"RequestId": 653193, "RunnerId": 76950}
2024-04-27T07:07:53Z INFO listener-app.listener Deleting last message {"lastMessageID": 180}
2024-04-27T07:07:53Z INFO listener-app.listener Deleting message session
2024-04-27T07:07:53Z INFO listener-app refreshing token {"githubConfigUrl": "https://github.com/<org_name>"}
2024-04-27T07:07:53Z INFO listener-app getting access token for GitHub App auth {"accessTokenURL": "https://api.github.com/app/installations/<installation_id>/access_tokens"}
2024-04-27T07:07:53Z INFO listener-app getting runner registration token {"registrationTokenURL": "https://api.github.com/orgs/<org_name>/actions/runners/registration-token"}
2024-04-27T07:07:53Z INFO listener-app getting Actions tenant URL and JWT {"registrationURL": "https://api.github.com/actions/runner-registration"}
2024/04/27 07:07:54 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}
We use runners that are rarely launched (no more than once or twice a day). Such a message appears during the first launch of the day.
After that, ghalistener
terminates with exit code 1
and the container transitions to Error state. Consequently, tasks stop being processed.
Today, after I added a restart policy, there have been no messages so far. I will continue to observe and provide additional information when it becomes available.
Hey everyone, I believe we found the root cause of this issue, and it is now fixed in 0.9.2. This was the fix: https://github.com/actions/actions-runner-controller/pull/3529
I will leave this issue opened for now until someone confirms they are not seeing these problems with the newest release :relaxed:
I think this issue is safe to close now since there are no reports of listener crashes in the latest release. Thank you all for providing more information, and please let us know if you experience any issues like this again!