add option to exit runner if docker isn't available
This PR implements an option to exit the runner in case Docker is not available with RUNNER_WAIT_FOR_DOCKER_EXIT_ON_FAILURE. It already has an option to wait a set amount of time for Docker to become available (RUNNER_WAIT_FOR_DOCKER_IN_SECONDS), but if Docker is still not ready after that time the runner simply ignores it and trucks along. For my use-case that is not the desired behavior, and I'd rather the runner exit with an error instead.
I'd argue that a runner starting without Docker is faulty given the many GitHub Actions features depending on it (container jobs, Docker Container Actions, service containers, ...), or at the very least that it's an option to prevent it from starting in such a state. The runner already have a similar mechanism for sudo, so it's not a stretch to do the same here.
My use-case - Action Runner Controller in AKS
While running ARC in an AKS cluster I've noticed intermittent issues with starting the docker:dind sidecar container for new nodes during the first few minutes of a nodes lifecycle. The issue resolves itself given a couple of minutes, but not before causing issues due to the initial set of started runners that runs without Docker, resulting in crashes for workflows depending on it. I'd rather have the runner exit with an error, which in the Kubernetes world would mean a retry of the pod which (eventually) resolves the issue. This is the timeline of events as of now:
- New node starting up
- New runner starting on the new node
- Error starting
docker:dind. It is not retried - Runner waits for Docker for
RUNNER_WAIT_FOR_DOCKER_IN_SECONDSseconds, but when the timer runs out it continues without it
- Error starting
- Workflows depending on Docker start crashing (container job, Docker actions, ...)
Notably I've also tried bumping the RUNNER_WAIT_FOR_DOCKER_IN_SECONDS to a higher number, but the creation of the docker:dind container is not retried automatically, meaning that once a runner has encountered this error in will eventually start without Docker available. It might be possible to configure AKS to do the retry, but in either case I believe it should be a supported use-case to simply kill the runner if it's faulty.