actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Scale Set Listener Stops Responding

Open jameshounshell opened this issue 1 year ago • 23 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.7.0

Deployment Method

Helm

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install both ARC and the ScaleSet helm charts to a kubernetes cluster.
2. Run Github Actions workflows/jobs normally with a self hosted runner.

Beyond this we do not have specific steps to reproduce and would very much appreciate suggestions for how we could gather more information to trigger this bug.

Describe the bug

Symptom:

  • Developers reported that jobs were pending for over 20 minutes (in the github workflow job view)
  • Restarted the controller and nothing happened
  • Deleted the listener and once a new listener pod starts, then new runner pods start. (Alternative we have now discovered that after about an hour, the listener will begin responding again and the logs says "refreshing token")

Diagnostics:

  • The listener stops logging
  • The only association we can make to alert on is using something like this ((gha_assigned_jobs > 0) and (rate(gha_assigned_jobs[10m]) == 0)) from the metrics provided by the listener. image

Another symptom I've noticed: When the controller pod is updated (for instance, between 0.7.0 and 0.8.0) two inconsistencies happen:

  • If there are old job pods running, they will block the new listener from starting. I need to delete the pods and do a rollout restart of the controller to get the listener going
  • Just updating the deployument doesn't seem to consistently start the listener either. There is always a chance that a rollout restart is needed to get it going.

A separate instance where we observed the issue

Currently running listener gha-runner-scale-set-controller:0.7.0 No logs were being generated, had found a workflow job in a pending state (from the listener logs). 10 Runners were online but the runner group was offline when looking at our org's action runner group settings ( url like this: https://github.com/organizations/bigcorp/settings/actions/runner-groups/9 )

Verified listener metrics had flatlined, for example: image

Then the queued job being watched was assigned with no intervention. Below is the section of logs where the listener had stopped and then started working again with no intervention.

2023-12-28T17:10:14Z	INFO	service	process batched runner scale set job messages.	{"messageId": 13462, "batchSize": 4}
2023-12-28T17:10:14Z	INFO	service	job started message received.	{"RequestId": 1026947, "RunnerId": 1062477}
2023-12-28T17:10:14Z	INFO	service	update job info for runner	{"runnerName": "self-hosted-pqbsj-runner-dkmf2", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / sql-lint / lint-migrations", "requestId": 1026947}
2023-12-28T17:10:14Z	INFO	KubernetesManager	Created merge patch json for EphemeralRunner status update	{"json": "{\"status\":{\"jobDisplayName\":\"build / sql-lint / lint-migrations\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026947,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z	INFO	service	job started message received.	{"RequestId": 1026948, "RunnerId": 1062478}
2023-12-28T17:10:14Z	INFO	service	update job info for runner	{"runnerName": "self-hosted-pqbsj-runner-gl8pq", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / build / code build", "requestId": 1026948}
2023-12-28T17:10:14Z	INFO	KubernetesManager	Created merge patch json for EphemeralRunner status update	{"json": "{\"status\":{\"jobDisplayName\":\"build / build / code build\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026948,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z	INFO	service	job assigned message received.	{"RequestId": 1026947}
2023-12-28T17:10:14Z	INFO	service	job assigned message received.	{"RequestId": 1026948}
2023-12-28T17:10:14Z	INFO	auto_scaler	acquiring jobs.	{"request count": 0, "requestIds": "[]"}
2023-12-28T17:10:15Z	INFO	auto_scaler	deleted message.	{"messageId": 13462}
2023-12-28T17:10:15Z	INFO	service	waiting for message...

2023-12-28T18:10:15Z	INFO	refreshing_client	message queue token is expired during GetNextMessage, refreshing...
2023-12-28T18:10:15Z	INFO	refreshing token	{"githubConfigUrl": "https://github.com/bigcorp"}
2023-12-28T18:10:15Z	INFO	getting access token for GitHub App auth	{"accessTokenURL": "https://api.github.com/app/installations/43625644/access_tokens"}
2023-12-28T18:10:15Z	INFO	getting runner registration token	{"registrationTokenURL": "https://api.github.com/orgs/bigcorp/actions/runners/registration-token"}
2023-12-28T18:10:15Z	INFO	getting Actions tenant URL and JWT	{"registrationURL": "https://api.github.com/actions/runner-registration"}
2023-12-28T18:10:16Z	INFO	service	process message.	{"messageId": 13463, "messageType": "RunnerScaleSetJobMessages"}
2023-12-28T18:10:16Z	INFO	service	current runner scale set statistics.	{"available jobs": 26, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 10, "busy runners": 0, "idle runners": 10}
2023-12-28T18:10:16Z	INFO	service	process batched runner scale set job messages.	{"messageId": 13463, "batchSize": 1}
2023-12-28T18:10:16Z	INFO	service	job completed message received.	{"RequestId": 1026947, "Result": "succeeded", "RunnerId": 1062477, "RunnerName": "self-hosted-pqbsj-runner-dkmf2"}
2023-12-28T18:10:16Z	INFO	auto_scaler	acquiring jobs.	{"request count": 0, "requestIds": "[]"}

Describe the expected behavior

Ideally the listener would never stop responding.

Additional Context

The only additional thing we tried was using the opt out button on our github app advanced features. This was kind of a hail mary since we saw the logs related to refreshing the token. image It seems to have helped but maybe that's just a fluke.

Controller Logs

Here's the logs from the time 10 min before through 10 min after the lister stops responding.

https://gist.github.com/jameshounshell/597358f9b0b624d1f80f98057ddddcf0

Runner Pod Logs

N/A

jameshounshell avatar Jan 05 '24 19:01 jameshounshell

We've had a pretty huge break in the case.

It appears that the listener stopping responding is directly linked to when the token is refreshed.

In the image in yellow you can see every time the token refreshes with the log message "refreshing_client". It appears to be reactive, in response to a failed http call, approximately happening every 60-65m. (which seems to jive with the max token lifetime of 1h for a github app) But before the outage there is no token refresh. Then 1 hour later the token refreshes and the logs begin again (and jobs start running again)

image

jameshounshell avatar Jan 05 '24 20:01 jameshounshell

Hey @jameshounshell,

Is it possible that there are network interrupts causing problems? At version 0.8.0, we did add client timeouts to retriable client that could help determine if there are any network issues. Just keep in mind, we are missing metrics on the newest listener app (Will be part of 0.8.2, and to enable metrics, you can use the old listener app as described here)

nikola-jokic avatar Jan 08 '24 12:01 nikola-jokic

We can sit tight and wait for 0.8.2 and report back on this issue.

jameshounshell avatar Jan 09 '24 21:01 jameshounshell

I am seeing this exact same problem with our listener pod. I can also confirm it appears to be related to the token refresh. Its always the last thing logged by the pod before it stops responding.

jasonluck avatar Jan 24 '24 16:01 jasonluck

Hey @jasonluck,

Could you also please post the log so we can inspect it? It is always better to have more evidence of an issue to help us find the root cause

nikola-jokic avatar Jan 25 '24 12:01 nikola-jokic

I am going to close this issue until we hear back from you. Please, let us know if the issue is resolved. If not, we can always re-open it :relaxed:

nikola-jokic avatar Feb 12 '24 09:02 nikola-jokic

Now that we've updated to the 0.8.2 runner, we are getting a better indication of what happens when the listener pod stops. Instead of just no longer polling for jobs, we now get the pod stopped in an error state. The Controller does not attempt to kill/restart the listener.

Here's the logs that I collected this morning:

{"severity":"info","ts":"2024-02-11T02:27:36Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":557}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Processing message","messageId":558,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":1,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":1,"totalRegisteredRunners":11,"totalBusyRunners":1,"totalIdleRunners":10}}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Job available message received","jobId":1132810}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Acquiring jobs","count":1,"requestIds":"[1132810]"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Message queue token is expired during GetNextMessage, refreshing..."}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"refreshing token","githubConfigUrl":"https://github.com/missionlane"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"getting access token for GitHub App auth","accessTokenURL":"https://api.github.com/app/installations/43625644/access_tokens"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"getting runner registration token","registrationTokenURL":"https://api.github.com/orgs/missionlane/actions/runners/registration-token"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"getting Actions tenant URL and JWT","registrationURL":"https://api.github.com/actions/runner-registration"}
{"severity":"info","ts":"2024-02-11T02:27:58Z","logger":"listener-app.listener","message":"Deleting message session"}
2024/02/11 02:27:58 Application returned an error: failed to handle message: failed to acquire jobs: failed to acquire jobs after session refresh: 400 - had issue communicating with Actions backend: Value cannot be null. (Parameter 'source')

ccureau avatar Feb 12 '24 21:02 ccureau

@nikola-jokic Please re-open this ticket :)

ccureau avatar Feb 12 '24 21:02 ccureau

We are on 0.8.1 (with the LISTENER_ENTRYPOINT workaround) and I think we saw this today, one of our four listeners just stopped responding. The controller finalised and removed its remaining runner pods as they finished their jobs but didn't produce any errors or otherwise appear to notice. About 45 mins later a node scaledown evicted the controller and it all came back to life after that.

We also saw a token refresh just before this happened, listener logs:

2024-02-14T11:55:41Z INFO refreshing_client message queue token is expired during AcquireJobs, refreshing...
2024-02-14T11:55:41Z INFO refreshing token {"githubConfigUrl": "https://github.com/********"}
2024-02-14T11:55:42Z INFO getting access token for GitHub App auth {"accessTokenURL": "https://api.github.com/app/installations/********/access_tokens"}"
2024-02-14T11:55:48Z INFO getting runner registration token {"registrationTokenURL": "https://api.github.com/orgs/********/actions/runners/registration-token"}"
2024-02-14T11:55:52Z INFO getting Actions tenant URL and JWT {"registrationURL": "https://api.github.com/actions/runner-registration"}"
2024-02-14T11:56:02Z INFO auto_scaler acquired jobs. {"requested": 5, "acquired": 5}
  
2024-02-14T12:41:20Z INFO Starting metrics server {"address": ":8080"}

We have been using the current version for about a month without seeing it until today, but we are on EKS and upgraded from 1.28 to 1.29 yesterday. If it happens again I may downgrade and see if it goes away.

RalphSleighK avatar Feb 14 '24 14:02 RalphSleighK

@RalphSleighK This is the behavior we saw with 0.8.1 as well. If left as it is, it seems to recover after about an hour of being brain dead. Otherwise it takes us deleting the pod to make it happy again.

ccureau avatar Feb 14 '24 14:02 ccureau

@ccureau What version of K8s are you using out of curiosity? We've just upgraded to 1.29, wondering if it's a coincidence that we saw this issue post-upgrade.

angusjellis avatar Feb 14 '24 14:02 angusjellis

@angusjellis We're on 1.27.10 in a GKE cluster. We did an upgrade from 1.26 to 1.27 while this was happening without any change in listener behavior for what it's worth.

ccureau avatar Feb 14 '24 14:02 ccureau

We got a new error this morning...unsure what to make of it:

{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":1298}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app.listener","message":"Deleting message session"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"refreshing token","githubConfigUrl":"https://github.com/missionlane"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"getting access token for GitHub App auth","accessTokenURL":"https://api.github.com/app/installations/43625644/access_tokens"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"getting runner registration token","registrationTokenURL":"https://api.github.com/orgs/missionlane/actions/runners/registration-token"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"getting Actions tenant URL and JWT","registrationURL":"https://api.github.com/actions/runner-registration"}
2024/02/14 11:21:43 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}

ccureau avatar Feb 14 '24 15:02 ccureau

With ghcr.io/actions/gha-runner-scale-set-controller:0.8.3 we still get

2024/03/06 13:19:57 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}

once/twice a day (k8s v1.28.5)

pavelsmolensky avatar Mar 06 '24 14:03 pavelsmolensky

With ghcr.io/actions/gha-runner-scale-set-controller:0.8.3 we still get

2024/03/06 13:19:57 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}

once/twice a day (k8s v1.28.5)

Same here with GKE v1.29.1 and gha-runner-scale-set-0.9.0

clementnero avatar Apr 12 '24 09:04 clementnero

Re-opening this one for https://github.com/actions/actions-runner-controller/issues/3204#issuecomment-2051417111

nikola-jokic avatar Apr 12 '24 09:04 nikola-jokic

Also seeing this here, with 0.9.1. The listener seems to be running but we don't see any new jobs being created in the cluster. What's frustrating is that the log level is already debug by default, and there's no way to get any further detail from the listener to see what might be happening.

I did take a quick look at the listener queue handling code, and while I'm not positive I'm either correct or that the possible problem is even related to this, I wonder if we have a goroutine leak here: https://github.com/actions/actions-runner-controller/blob/master/github/actions/client.go#L610

When a queue message is requested and read, if any of the error paths are followed, we defer the closing of resp.Body. If it is the happy path, we pass resp.Body to the json.NewDecoder (which only accepts an io.Reader) and never close it. Possibly we tie up all the client connections within leaked goroutines? I might make a PR to address it.

EDIT: I missed the full response body buffering in client.Do, so this is not a problem.

ohookins avatar Apr 29 '24 00:04 ohookins

Hey @ohookins,

If your runners are taking a long time to be scheduled on nodes, it may be related to this issue: https://github.com/actions/actions-runner-controller/issues/3450

nikola-jokic avatar Apr 29 '24 09:04 nikola-jokic

@nikola-jokic yes, that sounds very much like it! Thanks :)

ohookins avatar Apr 29 '24 10:04 ohookins

I add restartPolicy: Always to listenerTemplate.spec in AutoscalingRunnerSet. It seems like a temporary fix that has helped. The listener container restarts immediately after completing its work with exit 1 due to the issue described here, rather than waiting for the Pod to be deleted, which remains in an Error state the whole time.

So far, one day has passed, and I am waiting for more statistical data to accumulate.

verdel avatar Apr 29 '24 10:04 verdel

Hey @verdel,

When you can, please provide the log when the listener restarts. Since this has been an issue with the older version of the listener, it may be related to the actions server. In the 0.9.1 release, we added the trace ID which should help us track down the issue.

nikola-jokic avatar Apr 29 '24 10:04 nikola-jokic

I'll send the logs a bit later. We are staying on version 0.9.0 as we are waiting for a fix that will be added in this PR.

verdel avatar Apr 29 '24 10:04 verdel

Lately, I've been seeing the following message in the listener.

2024-04-27T07:07:53Z	INFO	listener-app.listener	Job started message received.	{"RequestId": 653192, "RunnerId": 76946}
2024-04-27T07:07:53Z	INFO	listener-app.listener	Job started message received.	{"RequestId": 653194, "RunnerId": 76949}
2024-04-27T07:07:53Z	INFO	listener-app.listener	Job started message received.	{"RequestId": 653193, "RunnerId": 76950}
2024-04-27T07:07:53Z	INFO	listener-app.listener	Deleting last message	{"lastMessageID": 180}
2024-04-27T07:07:53Z	INFO	listener-app.listener	Deleting message session
2024-04-27T07:07:53Z	INFO	listener-app	refreshing token	{"githubConfigUrl": "https://github.com/<org_name>"}
2024-04-27T07:07:53Z	INFO	listener-app	getting access token for GitHub App auth	{"accessTokenURL": "https://api.github.com/app/installations/<installation_id>/access_tokens"}
2024-04-27T07:07:53Z	INFO	listener-app	getting runner registration token	{"registrationTokenURL": "https://api.github.com/orgs/<org_name>/actions/runners/registration-token"}
2024-04-27T07:07:53Z	INFO	listener-app	getting Actions tenant URL and JWT	{"registrationURL": "https://api.github.com/actions/runner-registration"}
2024/04/27 07:07:54 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}

We use runners that are rarely launched (no more than once or twice a day). Such a message appears during the first launch of the day.

After that, ghalistener terminates with exit code 1 and the container transitions to Error state. Consequently, tasks stop being processed.

Today, after I added a restart policy, there have been no messages so far. I will continue to observe and provide additional information when it becomes available.

verdel avatar Apr 29 '24 15:04 verdel

Hey everyone, I believe we found the root cause of this issue, and it is now fixed in 0.9.2. This was the fix: https://github.com/actions/actions-runner-controller/pull/3529

I will leave this issue opened for now until someone confirms they are not seeing these problems with the newest release :relaxed:

nikola-jokic avatar May 22 '24 08:05 nikola-jokic

I think this issue is safe to close now since there are no reports of listener crashes in the latest release. Thank you all for providing more information, and please let us know if you experience any issues like this again!

nikola-jokic avatar May 24 '24 09:05 nikola-jokic