Docker daemon is not responding
Checks
- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image
Controller Version
0.27.4
Helm Chart Version
0.23.3
CertManager Version
1.10.0
Deployment Method
ArgoCD
cert-manager installation
Installed via Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: runners
namespace: actions-runner-system
spec:
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
dockerMTU: 1400
image: ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner:v2.304.0-ubuntu-20.04
imagePullPolicy: "Always"
ephemeral: true
organization: "<hidden>"
labels: [<hidden>]
To Reproduce
1. Run a job with Docker interaction
2. Randomly get
Error response from daemon: error creating temporary lease: connection error: desc = "transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout": unavailable
or
Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?
or other error where Docker daemon is not responding
Describe the bug
Docker daemon is not responding
Describe the expected behavior
Docker daemon is responding
Whole Controller Logs
https://gist.github.com/Tarasovych/a888e8c7e1edebc26ca1c547a778a860
Whole Runner Pod Logs
https://gist.github.com/Tarasovych/759f79483e9295ff0ad3fd020bbe8459
Additional Context
Similar issue: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/2408#note_383326688
Disk space is enough for the runner, with 3000 IOPS and 200 throughput.
/run directory has enough disk space (~ 100 KB free space at the peak load).
Hi @Tarasovych , I don't want to create a new issue when this one is so similar to mine, but I'm afraid it could be abandoned.
I'm facing the same issue - I'm using summerwind/actions-runner-dind:v2.308.0-ubuntu-22.04 (Although I have tried several other images) as basis for creating a custom image for my runners, but I noticed that docker commands stopped working (previously we were using VM-based runners)
So I'm testing it locally:
docker run -d -t summerwind/actions-runner-dind:v2.308.0-ubuntu-22.04 tail -f /dev/null
+
docker exec -it {containerID} /bin/bash
Once inside, docker ps always returns:
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Any advice would be greatly appreciated, I might be a bit lost on how to use docker commands on workflows inside k8s-self-hosted runners
This is randomly happening to us as well. On action images pulll, or in any action using docker:
/usr/local/bin/docker pull public.ecr.aws/aws-cli/aws-cli:latest
Error response from daemon: error creating temporary lease: connection error: desc = "transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout": unavailable
Warning: Docker pull failed with exit code 1, back off 8.208 seconds before retry.
/usr/local/bin/docker pull public.ecr.aws/aws-cli/aws-cli:latest
Error response from daemon: error creating temporary lease: connection error: desc = "transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout": unavailable
Warning: Docker pull failed with exit code 1, back off [3](https://github.com/songfinch/rails-app/actions/runs/6222090092/job/16885329442#step:3:3).288 seconds before retry.
/usr/local/bin/docker pull public.ecr.aws/aws-cli/aws-cli:latest
Error response from daemon: error creating temporary lease: connection error: desc = "transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout": unavailable
Experiencing it as well here.. but I noticed further up in our over-an-hour-long unity build that docker seems to die...
2023-10-25T19:54:37.5040463Z [BUSY 3149s] Link_Linux_x64_Clang Library/Bee/artifacts/LinuxPlayerBuildProgram/2ltnt/linker-output/GameAssembly.so
2023-10-25T19:54:47.6554035Z time="2023-10-25T19:54:47Z" level=error msg="error waiting for container: unexpected EOF"
2023-10-25T19:54:47.7414936Z ##[error]The process '/usr/local/bin/docker' failed with exit code 125
... maybe it's running out of space, memory, or timing out? The corresponding build step started at 2023-10-25T18:37:49.8765914Z - so about 1 hr 20 minutes in. I have the builder pods provisioned at 8cpu, 32gb ram just for sanity's sake.. and the nodes have 200gb of space on them..
I'm reasonably certain this can happen at any time with the current version of the ARC Helm chart. There is no liveness check on the dind container, therefore any action starting in the runner container can attempt to contact the docker daemon, and it might not have started yet.
I believe running dind as a sidecar (init container + restart policy) with a configured liveness probe should fix the initial docker daemon reachability issue. Configuring a restart policy in some way seems essential to fixing any reachability issue thereafter, although potentially resource constraints need to be considered if it is dying.