Intermittently getting "Cannot connect to the Docker daemon at unix:///var/run/docker.sock"
Checks
- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image
Controller Version
0.9.3
Helm Chart Version
0.9.3
CertManager Version
1.16.1
Deployment Method
ArgoCD
cert-manager installation
cert-manager is working
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
values.yaml for our arc runner set helm installation:
githubConfigUrl: https://github.com/<org>
controllerServiceAccount:
namespace: arc
name: arc-gha-rs-controller
githubConfigSecret: arc-runner-set
maxRunners: 2
minRunners: 1
runnerGroup: "default"
runnerScaleSetName: "custom"
containerMode:
type: dind
template:
spec:
hostNetwork: true
containers:
- name: runner
image: some.azurecr.io/custom-actions-runner:latest
command: ["/home/runner/run.sh"]
imagePullSecrets:
- name: acr-connectivity-pull
image:
actionsRunnerImagePullSecrets:
- name: acr-connectivity-pull
To Reproduce
Run any action which uses docker command, the error does not happen every time, I'd say it occurs 1/10 of the times, rerunning the job is usually successful.
Describe the bug
Running an action including docker command like:
docker build . --file Dockerfile --tag $env:FullImageName --secret id=npm_token,env=NPM_TOKEN --build-arg NODE_ENV=production
intermitently results in an error:
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? NativeCommandExitException: /home/runner/_work/_temp/52c5c530-065c-45b1-b663-3abe54de30f1.ps1:5 Line | 5 | docker build . --file Dockerfile --tag $env:FullImageName --secret id … | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | Program "docker" ended with non-zero exit code: 1.
Describe the expected behavior
Being able to connect to unix:///var/run/docker.sock 100% of the runs.
Whole Controller Logs
https://gist.github.com/AurimasNav/398f849114ad71860eb0a0fcf465d691
Whole Runner Pod Logs
https://gist.github.com/AurimasNav/0660c09ba17d845591169ddf230dce48
Additional Context
In the dind container log I can see:
failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: failed to create NAT chain DOCKER: iptables failed: iptables --wait -t nat -N DOCKER: iptables: Chain already exists. (exit status 1)
Not sure why that happens or how it can be solved? Might this have something to do with my config in values.yaml
template:
spec:
hostNetwork: true
(if I don't specify this, my containers in actions have no internet access).
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
@AurimasNav When you tried this without hostNetwork: true, was it in an environment with a service mesh sidecar injection like istio?
I ran into a similar issue with hostNetwork: true when 2 dind runners would come up on the same node at the same time.
One workflow would fail with
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
and the dind container logs would have
failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: failed to create NAT chain DOCKER: iptables failed: iptables --wait -t nat -N DOCKER: iptables: Resource temporarily unavailable.
I think this is because both runners were trying to use iptables at the same time, for the host network configuration. I suspect using hostNetwork: true may result in resource contention on the node.
Anyway, I was also using hostNetwork: true because the containers didn't have internet access without it, which was actually caused by istio sidecar injection. Runners with hostNetwork: true did not receive istio sidecars, while others did. Any runner with an istio sidecar did not have internet access in containers, and removing the sidecars fixed the "no internet access without hostNetwork" issue.
@AurimasNav When you tried this without
hostNetwork: true, was it in an environment with a service mesh sidecar injection like istio?I ran into a similar issue with
hostNetwork: truewhen 2dindrunners would come up on the same node at the same time.One workflow would fail with
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?and the
dindcontainer logs would havefailed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: failed to create NAT chain DOCKER: iptables failed: iptables --wait -t nat -N DOCKER: iptables: Resource temporarily unavailable.I think this is because both runners were trying to use
iptablesat the same time, for the host network configuration. I suspect usinghostNetwork: truemay result in resource contention on the node.Anyway, I was also using
hostNetwork: truebecause the containers didn't have internet access without it, which was actually caused by istio sidecar injection. Runners withhostNetwork: truedid not receive istio sidecars, while others did. Any runner with an istio sidecar did not have internet access in containers, and removing the sidecars fixed the "no internet access withouthostNetwork" issue.
There is no service mesh nor any kind sidecar injection, it is a k3s install on a single node server, but I guess it could potentially be a problem with 2 runners, even though I reduced the max runners to 1 instance, I have another actions runner controller set instance for different github org, running on the same k3s.
Whenever a pipeline runs, two pods are being created on our arc-runner-set (even though maxRunners is set to 1).
Almost every time dind container fails in one of those two:
NAME READY STATUS RESTARTS AGE
comp-9wssh-runner-kftrj 2/2 Running 0 12s
comp-9wssh-runner-k8h45 1/2 Error 0 12s
with the error:
failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: failed to create NAT chain DOCKER: iptables failed: iptables --wait -t nat -N DOCKER: iptables: Resource temporarily unavailable. (exit status 4)
if we are lucky the job is run on the "healthy" runner - everything is fine, but it seems to be 50/50 which one is selected, if we end up on the failed one, the job fails with:
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Error: Docker pull failed with exit code 1
From searching the internet it seems to be a problem with iptables configuration concurrency.
added configuration for restatPolicy to values.yaml
template:
spec:
hostNetwork: true
restartPolicy: OnFailure
so far it seems to restart the failed pod and the jobs are no longer failing, but I wonder is there some kind of downside given that by default it was set to never restart.
We see this too with 0.9.0 and 0.10.1. Self hosted runners. it comes in waves where most jobs fail, then all is well for a day or so. We are not using hostNetwork: true.
For what it's worth, this completely solved the problem for me.
If you are already customizing the template of your docker-in-docker runner, you can move the dind container from a standard container to an init container and set a retry policy on it so that it behaves as a "sidecar container".
Same issue here, but in my case i have never been able to successfully run a job that requires a container in any of the runners.
/usr/bin/docker build -t 62aa2a:ead5eee1324c4de4b6197fa0eb7dae25 -f "/home/runner/_work/_actions/.../.../v3.1/Dockerfile" "/home/runner/_work/_actions/.../.../v3.1"
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Warning: Docker build failed with exit code 1, back off 9.72 seconds before retry.
/usr/bin/docker build -t 62aa2a:ead5eee1324c4de4b6197fa0eb7dae25 -f "/home/runner/_work/_actions/.../.../v3.1/Dockerfile" "/home/runner/_work/_actions/.../.../v3.1"
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Warning: Docker build failed with exit code 1, back off 3.371 seconds before retry.
/usr/bin/docker build -t 62aa2a:ead5eee1324c4de4b6197fa0eb7dae25 -f "/home/runner/_work/_actions/.../.../v3.1/Dockerfile" "/home/runner/_work/_actions/.../.../v3.1"
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Error: Docker build failed with exit code 1```
I'm using AWS EKS, tried upgrading to latests version, same issue.