actions-runner-controller WAIT_FOR_DOCKER is not exiting after timeout

Checks

[X] I've already read https://github.com/actions-runner-controller/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.

Controller Version

NA

Helm Chart Version

NA

CertManager Version

NA

Deployment Method

Helm

cert-manager installation

yes

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue

Resource Definitions

NA

To Reproduce

see description

Describe the bug

Sometimes, the Docker container is not coming up (20.10.17-dind-alpine3.16@sha256:e25a101eb5ee4bc8772e862e908a33a133feb067a6d0d4a19cb7753d64596889). The runner is waiting 2 minutes, but then continues and picks up a job even Docker container is still not there. And then we see this

Cannot connect to the Docker daemon at tcp://localhost:2376. Is the docker daemon running?

in the Github Workflow logs.

Could it be, that there is some exit missing in the entrypoint script of the Runner?

if [[ "${DISABLE_WAIT_FOR_DOCKER}" != "true" ]] && [[ "${DOCKER_ENABLED}" == "true" ]]; then
    log.debug 'Docker enabled runner detected and Docker daemon wait is enabled'
    log.debug 'Waiting until Docker is available or the timeout is reached'
    timeout 120s bash -c 'until docker ps ;do sleep 1; done'

https://github.com/actions-runner-controller/actions-runner-controller/blob/11cb9b78829f8640ceb3bcb677e5d608dc3299ea/runner/entrypoint.sh

Describe the expected behavior

Runner container should not pick a job when Docker is not started. Ideally, K8s would kill that pod.

Controller Logs

NA

Runner Pod Logs

Runner pod log: https://gist.github.com/erichorwath/26be5fb65eb98b42a6b3eb868a27c3e0
Workflow log: https://gist.github.com/erichorwath/6a3fd5a976dc75f34e8e40e853a6b4cf

Additional Context

No response

Sep 22 '22 10:09 erichorwath

@erichorwath Thanks for reporting! Good catch... Sounds like you're correct. Would you mind modifying it to timeout 120s bash -c 'until docker ps ;do sleep 1; done' || exit 1 and confirm if it works?

Sep 22 '22 11:09 mumoshu

Hey @mumoshu, @erichorwath; was there any confirmation that this update to the entrypoint.sh was a good solution?

Oct 13 '22 17:10 rxa313

I have not tested it yet. Would you mind giving it a shot if you are affected by the said issue? Thanks!

Oct 14 '22 02:10 mumoshu

We are also facing same issue.

Oct 14 '22 10:10 GopikaV24

@GopikaV24 Thanks for reporting! Would you mind trying the proposed fix by building a custom runner image, and submit a PR if it works?

Oct 14 '22 10:10 mumoshu

Sure..

Oct 14 '22 10:10 GopikaV24