Runner Container Spinning Up Faster than Docker Daemon can be Ready.
ARC version: v0.24.1
Chart version: 0.18.0
Note: Using RunnerSet configuration.
I'm noticing an issue with some runners after they restart post execution. These runners are ephemeral and I've seen, sporadically, (it doesn't happen every time a runner is started), that sometimes the Runner container is getting spun up before docker daemon is ready. When this is happening if a job ends up running on that runner and trying to use docker, it's throwing a is docker daemon running error in the GitHub actions log. When I check on the runner logs it's spitting out is docker daemon running a handful of times and then says is listening for jobs but the docker container is showing red.. I'm assuming it was never resolved in that case.
I did try to do some configurations to help the runner container wait for the daemon to be up before it begins but I'm still seeing this issue come up from time to time.
What I'm using in my runner container:
containers:
- name: runner
imagePullPolicy: IfNotPresent
env:
- name: NODE_EXTRA_CA_CERTS
value: /usr/local/share/ca-certificates/<root.crt>
- name: STARTUP_DELAY_IN_SECONDS
value: "2"
- name: DISABLE_WAIT_FOR_DOCKER
value: "false"
Here's the start of the logs of the runner container of one of my runners:
2022-09-14 18:01:48.735 NOTICE --- Delaying startup by 2 seconds
2022-09-14 18:01:50.738 DEBUG --- Github endpoint URL https://github.com/
2022-09-14 18:01:51.344 DEBUG --- Passing --ephemeral to config.sh to enable the ephemeral runner.
2022-09-14 18:01:51.348 DEBUG --- Configuring the runner.
and after the runner is successfully configured (before connected to GitHub check):
2022-09-14 18:11:09.189 DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2022-09-14 18:11:09.191 DEBUG --- Waiting until Docker is available or the timeout is reached
Any additional recommendation to avoid the is docker daemon running? issue?
Thanks!
@rxa313 Hey! Does your dockerd take more time to start than 2 minutes? We recently discovered https://github.com/actions-runner-controller/actions-runner-controller/issues/1830 which doesn't let the runner container fail when docker wait failed. We'll be updating it to fail in that case. If it takes more time than 2 minutes, how often does it happen for you? Do you need to tweak the docker wait timeout duration? It's currently hard-coded to 2 minutes so if you need it, we'd need to update the entrypoint to accept another environment variable to tweak the timeout.
@mumoshu
My error is happening so sporadically I'd have to monitor the frequency to really tell. I looked at the issue you've linked and it's the same thing I'm experiencing. I think perhaps a longer timeout might help in my case to maybe like 10-30 seconds or so to just give the daemon more time to spin up. So far I haven't heard any complaints from my users since I added those parameters but I've been trying to keep an eye on it as we want optimal stability. I think giving us the ability to customize the timeout would help mitigate this issue further.