actions-runner-controller
actions-runner-controller copied to clipboard
0.9.1 does not work with on-demand provisioned nodes
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.1
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Configure Karpenter to schedule workload on a separate node pool
2. Run the pipeline
Describe the bug
We are using Karpenter to dynamically provision worker nodes for our pipelines. It takes around 2 minutes to provision the node and to start the runner pod on it. However, it seems that the controller attempts to scale the EphemeralRunnerSet down after 1 minute, while the pod is still in the Pending status. It leads to the pipeline being stuck
Might be related to https://github.com/actions/actions-runner-controller/issues/3420 and https://github.com/actions/actions-runner-controller/pull/3426
Describe the expected behavior
Controller does not scale the EphemeralRunnerSet down after 1 minute of runner pod being in the Pending state.
Additional Context
# values.yaml
arcOperator:
values:
replicaCount: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: gha-rs-controller
topologyKey: kubernetes.io/hostname
arcRunners:
values:
githubConfigUrl: "https://github.com/CareerLunch/career-lunch"
githubConfigSecret:
github_token: {{ requiredEnv "ARC_CHART_GITHUB_TOKEN" | quote }}
containerMode:
type: kubernetes
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
storageClassName: "career-lunch-bootstrap-aws-ebs-gp3"
resources:
requests:
storage: 10Gi
template:
metadata:
annotations:
karpenter.sh/do-not-evict: "true"
spec:
securityContext:
fsGroup: 123
nodeSelector:
workload: arc_runners_v1
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
actions-ephemeral-runner: "True"
topologyKey: kubernetes.io/hostname
tolerations:
- key: workload
operator: Equal
value: arc_runners_v1
effect: NoSchedule
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "false"
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /mnt/pod-template/pod-template.yml
volumeMounts:
- name: pod-template
mountPath: /mnt/pod-template
readOnly: true
resources:
requests:
memory: "250Mi"
cpu: "100m"
limits:
memory: "250Mi"
volumes:
- name: pod-template
configMap:
name: career-lunch-bootstrap-pod-template
---
# pod-template.yml
{{- $name := ( include "career-lunch-library.fullname" . ) }}
{{- $labels := ( include "career-lunch-library.labels" . ) }}
{{- /* See https://github.com/actions/runner-container-hooks/blob/main/docs/adrs/0096-hook-extensions.md */}}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ $name }}-pod-template
namespace: {{ .Values.arcRunners.namespace }}
labels:
{{ $labels | indent 4 }}
data:
pod-template.yml: |
{{- /*
Technically, we do not need a PodTemplate, see https://github.com/actions/runner-container-hooks/blob/main/examples/extension.yaml
But this communicates the intent better
*/}}
apiVersion: v1
kind: PodTemplate
metadata:
annotations:
karpenter.sh/do-not-evict: "true"
spec:
serviceAccountName: {{ include "career-lunch-library.fullname" . }}-arc-runners-sa
securityContext:
{{- /* Provides access to /home/runner/_work directory in ephemeral volume */}}
fsGroup: 123
nodeSelector:
workload: arc_runners_v1
tolerations:
- key: workload
operator: Equal
value: arc_runners_v1
effect: NoSchedule
containers:
{{- /* Overrides job container */}}
- name: $job
resources:
{{- /* Memory values are rather high, but we need them for webpack / Cypress jobs */}}
requests:
memory: 4Gi
cpu: 1
{{- /*
Best practice is to add limits too
However, we run these jobs in isolated instances and want them to use all resources available
Instance resources are configured via Karpenter
*/}}
Controller Logs
https://gist.github.com/wasd171/eae06234a06a26592a264cd617711a85
Look at the logs starting with 2024-04-17T17:32:32Z
Listener logs:
2024-04-17T17:32:24Z INFO listener-app.listener Job available message received {"jobId": 25672}
2024-04-17T17:32:24Z INFO listener-app.listener Acquiring jobs {"count": 1, "requestIds": "[25672]"}
2024-04-17T17:32:24Z INFO listener-app.listener Jobs are acquired {"count": 1, "requestIds": "[25672]"}
2024-04-17T17:32:24Z INFO listener-app.listener Deleting last message {"lastMessageID": 10}
2024-04-17T17:32:25Z INFO listener-app.worker.kubernetesworker Created merge patch json for EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
2024-04-17T17:32:25Z INFO listener-app.worker.kubernetesworker Scaling ephemeral runner set {"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 0, "jobsCompleted": 0}
2024-04-17T17:32:25Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 0}
2024-04-17T17:32:25Z INFO listener-app.listener Getting next message {"lastMessageID": 10}
2024-04-17T17:32:31Z INFO listener-app.listener Processing message {"messageId": 11, "messageType": "RunnerScaleSetJobMessages"}
2024-04-17T17:32:31Z INFO listener-app.listener New runner scale set statistics. {"statistics": {"totalAvailableJobs":0,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
2024-04-17T17:32:31Z INFO listener-app.listener Job assigned message received {"jobId": 25672}
2024-04-17T17:32:31Z INFO listener-app.listener Deleting last message {"lastMessageID": 11}
2024-04-17T17:32:32Z INFO listener-app.worker.kubernetesworker Created merge patch json for EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":1,\"replicas\":1}}"}
2024-04-17T17:32:32Z INFO listener-app.worker.kubernetesworker Scaling ephemeral runner set {"assigned job": 1, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 0, "jobsCompleted": 0}
2024-04-17T17:32:32Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 1}
2024-04-17T17:32:32Z INFO listener-app.listener Getting next message {"lastMessageID": 11}
2024-04-17T17:33:22Z INFO listener-app.worker.kubernetesworker Created merge patch json for EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
2024-04-17T17:33:22Z INFO listener-app.worker.kubernetesworker Scaling ephemeral runner set {"assigned job": 0, "decision": 0, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-04-17T17:33:22Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runners", "name": "arc-runners-prod-56cgq", "replicas": 0}
2024-04-17T17:33:22Z INFO listener-app.listener Getting next message {"lastMessageID": 11}
Runner Pod Logs
Unable to provide since the pod is evicted by controller
Hey @wasd171,
Thank you for reporting this! The root cause of the problem is that in 0.9.1, the assumption was that if there is an empty batch, we self-correct the count. My assumption was that 50s is enough time for the cluster to be ready, but in this case, it was obviously wrong.
An important thing to point out is that it should self correct on the next run on the scale set. However, this should be fixed, so again, thank you for bringing this to our attention! Please, use the older version of ARC in the meantime. You can also use the older version of the listener that doesn't propagate patch ID, which may occasionally create more pods than necessary, but it would handle this case appropriately.
@nikola-jokic , I want to check if this is the same issue I am having.
I have a listener that scheduled 7 jobs to 7 arc-runner pods, but those pods were pending while waiting for a node to come online. After about a minute, all of those pods went away, and then 1-2 minutes later the node was ready for them to run on. The pods didn't come back up, though.
Meanwhile, in github actions, the jobs were stuck waiting for a runner group. The listener logs in arc-systems shows that those jobs were assigned, but the pods they were assigned to had been terminated (probably by the controller) and so the jobs were lost never to be scheduled.
Does this sound like the same thing, and would rolling back to 0.9.0 to address the problem?
Hey @koreyGambill,
I think so. Rolling back to 0.9.0 should fix the problem. I have to mention that in a very unlucky case, 0.9.0 can increase latency of starting a job. It is a rare case, and if you have a busy cluster, it is very unlikely to happen. However, if you want to be 100% sure everything is executed as quickly as possible in every situation, please rollback to 0.8.3. This version of controller can create and delete ephemeral runners more than 0.9.0, but will ensure that the runner is created as soon as possible. In the next release (0.9.2), the controller should be able to pick up everything right away, and decrease the number of created pods.
If we could set the waiting time for pods in the gha-runner-scale-set-controller values.yaml, that would be great :)
Hey @ywang-psee,
We don't have a waiting time, it was a bug introduced in 0.9.1 :disappointed: This PR should fix it.
Think we are having the same problem. Also using Karpenter here, and actually saw the listener create the appropriate ephemeralrunnerset settings, then ephemeralrunners started creating, then actual pods, but they all get killed off very quickly and it's back to 0 before the queued jobs can be picked up.
Have reverted to 0.9.0 and things seem to be working again.
Yep, same here, we reverted to 0.9.0 as well! ...Keeping an eye on this PR #3204
@nikola-jokic Hi, do you know when will release 0.9.2 happen ?
Hey @chenghuang-mdsol,
I cannot promise anything, but hopefully, later this week or next week.
I need to dig into it but "rolling back" to 0.8.3 results in chart diff errors with regard to clusterrole and the namespace it's in. Chart behavior must have changed with regard to where accounts are created (which namespace). I set the namespace aligned with the quick start docs (arc-runners and arc-systems) - do most people just use kube-system? We are still testing so I can blow it all away but just wondering.
Also for removal - will helm uninstall
do this cleanly or is there CRD cleanup required?
Hey everyone :wave:
Would anyone be interested in testing out this fix before we release it please :pray:
To do it, you can follow these steps:
- In the gha-runner-scale-set-controller values.yaml file, update the tag value to: canary-3bda9bb
- Update the field appVersion in the Chart.yaml file for gha-runner-scale-set to be: canary-3bda9bb
- Redeploy ARC using the updated helm chart and values.yaml files
Thank you in advance! :heart:
Thanks @nikola-jokic I just tested with the canary image tag and the helm chart, and looks like it fixed the issue
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
arc arc-systems 1 2024-05-18 00:57:23.367937 +0900 KST deployed gha-runner-scale-set-controller-0.9.1 0.9.1
runner-arm64 arc-runners 1 2024-05-18 00:57:25.176608 +0900 KST deployed gha-runner-scale-set-0.9.1 canary-3bda9bb
$ kubectl get pod -n arc-runners
NAME READY STATUS RESTARTS AGE
runner-arm64-k9mpd-runner-j7g9n 2/2 Running 0 82s
it took roughly 70~80 sec for the node to be ready and for the runner pod to start running, and the controller tolerated the start time.
However, one thing I want to call out is that when I firstly tried to install this canary image and chart, the listener failed to boot up with the following missing role.
2024-05-17T15:44:10Z ERROR AutoscalingRunnerSet Failed to remove finalizers on dependent resources {"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}, "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
github.com/actions/actions-runner-controller/controllers/actions%2egithub%2ecom.(*AutoscalingRunnerSetReconciler).Reconcile
github.com/actions/actions-runner-controller/controllers/actions.github.com/autoscalingrunnerset_controller.go:139
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z ERROR Reconciler error {"controller": "autoscalingrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "AutoscalingRunnerSet", "AutoscalingRunnerSet": {"name":"runner-arm64","namespace":"arc-runners"}, "namespace": "arc-runners", "name": "runner-arm64", "reconcileID": "1f35d8c0-b14f-471b-bbc0-c132a7a5196b", "error": "failed to patch manager role without finalizer: roles.rbac.authorization.k8s.io \"runner-arm64-gha-rs-manager\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2024-05-17T15:44:10Z INFO AutoscalingRunnerSet Deleting resources {"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}
2024-05-17T15:44:10Z INFO AutoscalingRunnerSet Cleaning up the listener {"version": "canary-3bda9bb", "autoscalingrunnerset": {"name":"runner-arm64","namespace":"arc-runners"}}
Maybe it was just a timing issue that role/rolebinding created after the controller got created, or just a one-off blip because I couldn't reproduce it after all. I thought I would share it just in case anyways.
HI , i have same issues in 0.9.3 @nikola-jokic
2024-07-09T07:38:10.595179313Z stdout F [RUNNER 2024-07-09 07:38:10Z ERR JobDispatcher] Catch exception during renew runner jobrequest 79794.
2024-07-09T07:38:11.246446746Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] System.TimeoutException: The HTTP request timed out after 00:01:00.
2024-07-09T07:38:11.246461895Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
2024-07-09T07:38:11.246467637Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
2024-07-09T07:38:11.246487934Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246493242Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
2024-07-09T07:38:11.246497368Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
2024-07-09T07:38:11.246500393Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246503597Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246506861Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] --- End of inner exception stack trace ---
2024-07-09T07:38:11.246509682Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246512302Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246515406Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
2024-07-09T07:38:11.246518675Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
2024-07-09T07:38:11.246521713Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.DistributedTask.WebApi.TaskAgentHttpClient.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken, Func`3 processResponse)
2024-07-09T07:38:11.246530257Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.DistributedTask.WebApi.TaskAgentHttpClient.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken, Func`3 processResponse)
2024-07-09T07:38:11.246533658Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Runner.Listener.JobDispatcher.RenewJobRequestAsync(IRunnerServer runnerServer, Int32 poolId, Int64 requestId, Guid lockToken, String orchestrationId, TaskCompletionSource`1 firstJobRequestRenewed, CancellationToken token)
2024-07-09T07:38:11.246537357Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] #####################################################
2024-07-09T07:38:11.246539905Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] System.Threading.Tasks.TaskCanceledException: A task was canceled.
2024-07-09T07:38:11.246542555Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
2024-07-09T07:38:11.246545359Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246556749Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
2024-07-09T07:38:11.246559539Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.AuthenticationHelper.SendWithAuthAsync(HttpRequestMessage request, Uri authUri, Boolean async, ICredentials credentials, Boolean preAuthenticate, Boolean isProxyAuth, Boolean doRequestAuth, HttpConnectionPool pool, CancellationToken cancellationToken)
2024-07-09T07:38:11.246562315Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-07-09T07:38:11.246565222Z stdout F [RUNNER 2024-07-09 07:38:11Z ERR JobDispatcher] at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
2024-07-09T07:38:11.246568123Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO JobDispatcher] Retrying lock renewal for jobrequest 79794. Job is valid until 07/09/2024 07:45:56.
2024-07-09T07:38:11.263080643Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO RunnerServer] Refresh JobRequest VssConnection to get on a different AFD node.
2024-07-09T07:38:11.263095261Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO RunnerServer] EstablishVssConnection
2024-07-09T07:38:11.26309838Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO RunnerServer] Establish connection with 30 seconds timeout.
2024-07-09T07:38:11.338495992Z stdout F [RUNNER 2024-07-09 07:38:11Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
{
"$id": "1",
"innerException": null,
"message": "The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.",
"typeName": "Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server",
"typeKey": "UnauthorizedRequestException",
"errorCode": 0,
"eventId": 3000
}
not sure why
EKS 1.27 ... the setup is dind