actions-runner-controller
actions-runner-controller copied to clipboard
Runners keep throwing docker Daemon running
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
actions-runner-controller-0.22.0
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1) installed ARC with github token
2) configured runnerdeployment with replica -10
3) configured horizontal scaler - min 10 max 50
4) runners are in running state, but after some time new runners getting docker daemon running error and job are waiting in queue to pick the runner
happening for all the new runners then pod going to error state
Describe the bug
runners are in running state, but after some time new runners getting docker daemon running error and job are waiting in queue to pick the runner
happening for all the new runners then pod going to error state
Describe the expected behavior
based on the load horizontal scaler should scale the runners, but it is throwing docker daemon running ? error
Additional Context
❯ helm get values actions-runner-controller
USER-SUPPLIED VALUES:
authSecret:
create: true
github_token: "" supplied github token
Controller Logs
a2844b0833", "allowed": true}
2024-01-12T21:31:30Z INFO runner Failed to create pod due to AlreadyExists error. Probably this pod has been already created in previous reconcilation but is still not in the informer cache. Will retry on pod created. If it doesn't repeat, there's no problem {"runner": "actions-runner-systems/github-action-np-h6z4z-gf9pg"}
2024-01-12T21:31:31Z DEBUG runner Runner appears to have been registered and running. {"runner": "actions-runner-systems/github-action-np-h6z4z-gf9pg", "podCreationTimestamp": "2024-01-12 21:31:30 +0000 UTC"}
2024-01-12T21:31:36Z INFO runnerpod Failed to delete pod within 1m0s. This is typically the case when a Kubernetes node became unreachable and the kube controller started evicting nodes. Forcefully deleting the pod to not get stuck. {"runnerpod": "actions-runner-systems/github-action-np-h6z4z-qjc6z", "podDeletionTimestamp": "2024-01-12 21:30:25 +0000 UTC", "currentTime": "2024-01-12T21:31:36Z", "configuredDeletionTimeout": "1m0s"}
2024-01-12T21:31:36Z INFO runnerpod Forcefully deleted runner pod {"runnerpod": "actions-runner-systems/github-action-np-h6z4z-qjc6z", "repository": ""}
2024-01-12T21:31:36Z DEBUG events Forcefully deleted pod 'github-action-np-h6z4z-qjc6z' {"type": "Normal", "object": {"kind":"Pod","namespace":"actions-runner-systems","name":"github-action-np-h6z4z-qjc6z","uid":"3be68602-3db3-4803-a2b7-9ae0ec52df94","apiVersion":"v1","resourceVersion":"605720"}, "reason": "PodDeleted"}
2024-01-12T21:31:39Z DEBUG horizontalrunnerautoscaler Suggested desired replicas of 10 by PercentageRunnersBusy {"replicas_desired_before": 10, "replicas_desired": 10, "num_runners": 10, "num_runners_registered": 9, "num_runners_busy": 6, "num_terminating_busy": 0, "namespace": "actions-runner-systems", "kind": "runnerdeployment", "name": "github-action-np", "horizontal_runner_autoscaler": "example-runner-deployment-autoscaler", "enterprise": "", "organization": "prosperllc", "repository": ""}
2024-01-12T21:31:39Z DEBUG horizontalrunnerautoscaler Calculated desired replicas of 10 {"horizontalrunnerautoscaler": "actions-runner-systems/example-runner-deployment-autoscaler", "suggested": 10, "reserved": 0, "min": 10, "max": 20}
2024-01-12T21:32:19Z DEBUG runner Runner appears to have been registered and running. {"runner": "actions-runner-systems/github-action-np-h6z4z-9f5cp", "podCreationTimestamp": "2024-01-12 21:25:58 +0000 UTC"}
2024-01-12T21:32:19Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "1e47c97c-345a-4b39-825f-67129cf2201d", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2024-01-12T21:32:19Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "1e47c97c-345a-4b39-825f-67129cf2201d", "allowed": true}
2024-01-12T21:32:19Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "89c6d115-f817-472c-b629-0489fd90e10e", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2024-01-12T21:32:19Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "code": 200, "reason": "", "UID": "89c6d115-f817-472c-b629-0489fd90e10e", "allowed": true}
Runner Pod Logs
"https://pipelinesghubeus21.actions.githubusercontent.com/tMTkzAKYleoidiHAI9FjPaHPkEkp2s7TIoUW3BW1740YmeFlFo/",
"gitHubUrl": "https://github.com/prosperllc",
"workFolder": "/runner/_work"
2024-01-12 21:34:47.510 DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2024-01-12 21:34:47.512 DEBUG --- Waiting until Docker is available or the timeout of 120 seconds is reached
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
Cannot connect to the Docker daemon at tcp://localhost:2376. Is the docker daemon running?
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
this is the message we are seeing 2024-01-12 22:19:43.778 DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled 2024-01-12 22:19:43.780 DEBUG --- Waiting until Docker is available or the timeout of 120 seconds is reached Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Failed to initialize: unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory Cannot connect to the Docker daemon at tcp://localhost:2376. Is the docker daemon running?
Hi Team, can one suggest me on this issue ?
Thanks Sridhar
HI Team,
any suggestions on the above one?
Thanks Sridhar
HI @nikola-jokic can you please advise on the above ticket ?
We just had a similar issue in one of our clusters today. We tracked the root cause to the latest v25.0.0 release of docker:dind and ended setting the Helm value for image.dindSidecarRepositoryAndTag
to docker:24.0.7-dind
which solved the issue. Maybe give that a try.
@emilwangaa so i need to upgrade my actions runner controller chat to 25.0.0 and update docker dind verison ?
@emilwangaa so i need to upgrade my actions runner controller chat to 25.0.0 and update docker dind verison ?
@prosper-sre I don't think you need to update your arc chart unless it doesn't support setting the dind version. The default setting for the arc chart is to pull the latest version if dind, which caused issues for us.
Hello @emilwangaa, could you share how you changed image.dindSidecarRepositoryAndTag: docker:24.0.7-dind
? I tried
helm upgrade --install -f custom-values.yaml --namespace actions-runner-system --create-namespace --wait actions-runner-controller actions-runner-controller/actions-runner-controller --set image.dindSidecarRepositoryAndTag=docker:24.0.7-dind --version ${CHART_VERSION}
but I still see the image dind:dind
configured on the re-deployed runners
Hello @emilwangaa, could you share how you changed
image.dindSidecarRepositoryAndTag: docker:24.0.7-dind
? I triedhelm upgrade --install -f custom-values.yaml --namespace actions-runner-system --create-namespace --wait actions-runner-controller actions-runner-controller/actions-runner-controller --set image.dindSidecarRepositoryAndTag=docker:24.0.7-dind --version ${CHART_VERSION}
but I still see the image
dind:dind
configured on the re-deployed runners
We use Terraform to install the chart, but your method looks right. Which version of the chart are you using? And have you tried setting it in the custom-values.yaml
file that you specify instead?
i have done the upgrade, but still seeing the same issues. looks like one of the workflow causing this problem and it is mono repo . some thing wrong with that . i need to figure it out.
i think it is purely resource issues, do we have any recommended specifications w.r.t to cpu memory ?
Thanks for the response @emilwangaa I can now see docker:24.0.7-dind
but the issue still persists. I'm using version 0.23.7
@emilwangaa I also did a version update and am experiencing the same issue, maybe it was fixed after updating to that version?