actions-runner-controller Docker daemon is not responding

Checks

[X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
[X] I'm not using a custom entrypoint in my runner image

Controller Version

0.27.4

Helm Chart Version

0.23.3

CertManager Version

1.10.0

Deployment Method

ArgoCD

cert-manager installation

Installed via Helm

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
[X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runners
  namespace: actions-runner-system
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      dockerMTU: 1400
      image: ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner:v2.304.0-ubuntu-20.04
      imagePullPolicy: "Always"
      ephemeral: true
      organization: "<hidden>"
      labels: [<hidden>]

To Reproduce

1. Run a job with Docker interaction
2. Randomly get

Error response from daemon: error creating temporary lease: connection error: desc = "transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout": unavailable

or

Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?

or other error where Docker daemon is not responding

Describe the bug

Docker daemon is not responding

Describe the expected behavior

Docker daemon is responding

Whole Controller Logs

https://gist.github.com/Tarasovych/a888e8c7e1edebc26ca1c547a778a860

Whole Runner Pod Logs

https://gist.github.com/Tarasovych/759f79483e9295ff0ad3fd020bbe8459

Additional Context

Similar issue: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/2408#note_383326688 Disk space is enough for the runner, with 3000 IOPS and 200 throughput. /run directory has enough disk space (~ 100 KB free space at the peak load).

Jun 22 '23 06:06 Tarasovych

Hi @Tarasovych , I don't want to create a new issue when this one is so similar to mine, but I'm afraid it could be abandoned.

I'm facing the same issue - I'm using summerwind/actions-runner-dind:v2.308.0-ubuntu-22.04 (Although I have tried several other images) as basis for creating a custom image for my runners, but I noticed that docker commands stopped working (previously we were using VM-based runners)

So I'm testing it locally:

docker run -d -t summerwind/actions-runner-dind:v2.308.0-ubuntu-22.04 tail -f /dev/null + docker exec -it {containerID} /bin/bash

Once inside, docker ps always returns: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Any advice would be greatly appreciated, I might be a bit lost on how to use docker commands on workflows inside k8s-self-hosted runners

Aug 31 '23 14:08 robgutsopedra

This is randomly happening to us as well. On action images pulll, or in any action using docker:

/usr/local/bin/docker pull public.ecr.aws/aws-cli/aws-cli:latest
  Error response from daemon: error creating temporary lease: connection error: desc = "transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout": unavailable
  Warning: Docker pull failed with exit code 1, back off 8.208 seconds before retry.
  /usr/local/bin/docker pull public.ecr.aws/aws-cli/aws-cli:latest
  Error response from daemon: error creating temporary lease: connection error: desc = "transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout": unavailable
  Warning: Docker pull failed with exit code 1, back off [3](https://github.com/songfinch/rails-app/actions/runs/6222090092/job/16885329442#step:3:3).288 seconds before retry.
  /usr/local/bin/docker pull public.ecr.aws/aws-cli/aws-cli:latest
  Error response from daemon: error creating temporary lease: connection error: desc = "transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout": unavailable

Sep 18 '23 12:09 gaspo53

Experiencing it as well here.. but I noticed further up in our over-an-hour-long unity build that docker seems to die...

2023-10-25T19:54:37.5040463Z [BUSY    3149s] Link_Linux_x64_Clang Library/Bee/artifacts/LinuxPlayerBuildProgram/2ltnt/linker-output/GameAssembly.so
2023-10-25T19:54:47.6554035Z time="2023-10-25T19:54:47Z" level=error msg="error waiting for container: unexpected EOF"
2023-10-25T19:54:47.7414936Z ##[error]The process '/usr/local/bin/docker' failed with exit code 125

... maybe it's running out of space, memory, or timing out? The corresponding build step started at 2023-10-25T18:37:49.8765914Z - so about 1 hr 20 minutes in. I have the builder pods provisioned at 8cpu, 32gb ram just for sanity's sake.. and the nodes have 200gb of space on them..

Oct 25 '23 21:10 gotchipete

I'm reasonably certain this can happen at any time with the current version of the ARC Helm chart. There is no liveness check on the dind container, therefore any action starting in the runner container can attempt to contact the docker daemon, and it might not have started yet.

I believe running dind as a sidecar (init container + restart policy) with a configured liveness probe should fix the initial docker daemon reachability issue. Configuring a restart policy in some way seems essential to fixing any reachability issue thereafter, although potentially resource constraints need to be considered if it is dying.

Mar 26 '24 04:03 ohookins