actions-runner-controller
actions-runner-controller copied to clipboard
Support DinD in user-specified "container:" in jobs
What would you like added?
I'm able to run ARC with docker fine using containerMode: dind
.
But I'd also like to enable my users to specify container:
in Actions workflow jobs, and start docker builds etc. from there.
My understanding is that specifying container: on an Actions workflow job will cause the actions-runner container in the pod to spin up another docker container. But in here, there's no socket:
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
Why is this needed?
Users want to run their own containers, and performing Docker builds etc. is a common CI/CD task.
Today, devops needs to be a central bottleneck in configuring runner sets with these specific containers, and making sure they work with the containerType: dind.
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
Hey @zxti,
Please correct me if I misunderstood, but you can provide a volume in your workflow which can mount the docker socket to your container
.
I believe the runner hardcodes a specific path to the Docker sock when running a container job. As far as I can tell they don't let us easily overwrite this without writing our own container hook which is very heavy.
Is the alternative to change every one of our workflow files to manually mount the docker.dock to the the runner container (by default seems the dind container mode seems to mount it from the host into /run/docker/docker.sock
in the runner container instead of /var/run/docker.sock
)? For example:
container:
image: <image>
options: -v "/run/docker/docker.sock:/var/run/docker.sock"
To me that seems like quite the leakage of implementation details to the consumer of the scale sets. Perhaps I'm misunderstanding the suggestion?
EDIT: To be clear I am able to launch the job with a docker container just fine, which I'm not able to do is run a docker build within that container job because it can't find the docker socket
Also interested in this, as I'm facing similary issue. I use a shared github action in my workflow
- name: Run Trivy vulnerability scanner with exclusions
uses: aquasecurity/[email protected]
with:
image-ref: ${{ inputs.docker_image }}
format: 'table'
exit-code: '1'
ignore-unfixed: true
trivyignores: ${{ inputs.trivy_ignore_path }}
severity: 'CRITICAL,HIGH'
timeout: 15m
which executes the following command
/usr/bin/docker run --name b1cbc5785fc65dd52a4c82a2774efc8b669fef_1185fa --label b1cbc5 --workdir /github/workspace --rm -e "GOOS" -e "GOARCH" -e "GOPRIVATE" -e "CGO_ENABLED" -e "INPUT_IMAGE-REF" -e "INPUT_FORMAT" -e "INPUT_EXIT-CODE" -e "INPUT_IGNORE-UNFIXED" -e "INPUT_SEVERITY" -e "INPUT_TIMEOUT" -e "INPUT_SCAN-TYPE" -e "INPUT_INPUT" -e "INPUT_SCAN-REF" -e "INPUT_VULN-TYPE" -e "INPUT_TEMPLATE" -e "INPUT_OUTPUT" -e "INPUT_SKIP-DIRS" -e "INPUT_SKIP-FILES" -e "INPUT_CACHE-DIR" -e "INPUT_IGNORE-POLICY" -e "INPUT_HIDE-PROGRESS" -e "INPUT_LIST-ALL-PKGS" -e "INPUT_SECURITY-CHECKS" -e "INPUT_TRIVYIGNORES" -e "INPUT_ARTIFACT-TYPE" -e "INPUT_GITHUB-PAT" -e "INPUT_TRIVY-CONFIG" -e "HOME" -e "GITHUB_JOB" -e "GITHUB_REF" -e "GITHUB_SHA" -e "GITHUB_REPOSITORY" -e "GITHUB_REPOSITORY_OWNER" -e "GITHUB_REPOSITORY_OWNER_ID" -e "GITHUB_RUN_ID" -e "GITHUB_RUN_NUMBER" -e "GITHUB_RETENTION_DAYS" -e "GITHUB_RUN_ATTEMPT" -e "GITHUB_REPOSITORY_ID" -e "GITHUB_ACTOR_ID" -e "GITHUB_ACTOR" -e "GITHUB_TRIGGERING_ACTOR" -e "GITHUB_WORKFLOW" -e "GITHUB_HEAD_REF" -e "GITHUB_BASE_REF" -e "GITHUB_EVENT_NAME" -e "GITHUB_SERVER_URL" -e "GITHUB_API_URL" -e "GITHUB_GRAPHQL_URL" -e "GITHUB_REF_NAME" -e "GITHUB_REF_PROTECTED" -e "GITHUB_REF_TYPE" -e "GITHUB_WORKFLOW_REF" -e "GITHUB_WORKFLOW_SHA" -e "GITHUB_WORKSPACE" -e "GITHUB_ACTION" -e "GITHUB_EVENT_PATH" -e "GITHUB_ACTION_REPOSITORY" -e "GITHUB_ACTION_REF" -e "GITHUB_PATH" -e "GITHUB_ENV" -e "GITHUB_STEP_SUMMARY" -e "GITHUB_STATE" -e "GITHUB_OUTPUT" -e "GITHUB_ACTION_PATH" -e "RUNNER_OS" -e "RUNNER_ARCH" -e "RUNNER_NAME" -e "RUNNER_ENVIRONMENT" -e "RUNNER_TOOL_CACHE" -e "RUNNER_TEMP" -e "RUNNER_WORKSPACE" -e "ACTIONS_RUNTIME_URL" -e "ACTIONS_RUNTIME_TOKEN" -e "ACTIONS_CACHE_URL" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/_work/_temp/_github_home":"/github/home" -v "/home/runner/_work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/_work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/_work/example/example":"/github/workspace" b1cbc5:785fc65dd52a4c82a2774efc8b669fef "-a image" "-b table" "-c " "-d 1" "-e true" "-f os,library" "-g CRITICAL,HIGH" "-h " "-i example/example/example:e67777a6d" "-j ." "-k " "-l " "-m " "-n 15m" "-o " "-p " "-q " "-r false" "-s " "-t " "-u " "-v "
and fails with
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Apologies if it's not related, I'm still trying to understand the flow of actions
dockerd can be started using hooks:
AutoscalingRunnerSet envs
env:
- name: ACTIONS_RUNNER_HOOK_JOB_STARTED
value: /home/runner/hooks/common-start.sh
common-start.sh:
#!/usr/bin/env bash
set -u
source /home/runner/hooks/logger.sh
source /home/runner/hooks/wait.sh
log.debug 'Starting Docker daemon'
sudo /usr/bin/dockerd &
log.debug 'Waiting for processes to be running...'
processes=(dockerd)
for process in "${processes[@]}"; do
if ! wait_for_process "$process"; then
log.error "$process is not running after max time"
exit 1
else
log.debug "$process is running"
fi
done
logger and wait scripts can be seen in https://github.com/actions/actions-runner-controller/blob/master/runner/wait.sh https://github.com/actions/actions-runner-controller/blob/master/runner/logger.sh
So: 1 env variable and 3 files needs to be injected. Its up to the user how that can be done. Configmap, new image(?). We are building own image which contains for instance cached actions and settings like this.
ps. Github default image is missing quite much dependencies to get docker started. At least fuse-overlayfs
and iptables
is needed for that.
Oh forgot one thing:
securityContext:
privileged: true # needed for dockerd
that is needed as well
dockerd can be started using hooks:
AutoscalingRunnerSet envs
env: - name: ACTIONS_RUNNER_HOOK_JOB_STARTED value: /home/runner/hooks/common-start.sh
common-start.sh:
#!/usr/bin/env bash set -u source /home/runner/hooks/logger.sh source /home/runner/hooks/wait.sh log.debug 'Starting Docker daemon' sudo /usr/bin/dockerd & log.debug 'Waiting for processes to be running...' processes=(dockerd) for process in "${processes[@]}"; do if ! wait_for_process "$process"; then log.error "$process is not running after max time" exit 1 else log.debug "$process is running" fi done
logger and wait scripts can be seen in https://github.com/actions/actions-runner-controller/blob/master/runner/wait.sh https://github.com/actions/actions-runner-controller/blob/master/runner/logger.sh
So: 1 env variable and 3 files needs to be injected. Its up to the user how that can be done. Configmap, new image(?). We are building own image which contains for instance cached actions and settings like this.
ps. Github default image is missing quite much dependencies to get docker started. At least
fuse-overlayfs
andiptables
is needed for that.Oh forgot one thing:
securityContext: privileged: true # needed for dockerd
that is needed as well
Making a hook is exactly what I needed to get around this, using a modified Github's published example, though I am still a bit curious why the upstream runner project has the Docker socket file path hard coded? The hook works but is somewhat undesirable -- when the workflow fails for any reason, it spits out a "Please contact your self-hosted administrator" and I haven't found a way to get rid of that error message.
I'd rather not use a hook if all that's needed is an upstream change to be able to override that hardcoded string with an environment variable.
yeah I do not understand why docker is not made available automatically. That is available if you use like ubuntu-latest
, why its not included in runner image?
Hey everyone,
Until this is resolved, may I suggest a workaround for this particular use-case. To have it working, please comment out the entire containerMode
object (or leave containerMode.type
empty), and provide a dind
spec as a side-car container as described here.
The containerMode
object does not influence the controller in any way. It just expands the helm template, so it is more convenient to specify a dind
configuration in most cases.
@nikola-jokic i think your link for the spec should not refer to the sha abc0b678d323b
but to a newer one (or master).
The sha referenced might be missleading since it says there
## - name: DOCKER_HOST
## value: tcp://localhost:2376
I would think that the current master might be better here Link with the sha at the time of writing this comment here
We just did the migration to the new runner scale sets.
Basically what is important is to have the docker socket available on /var/run/docker.sock
there are implicit assumptions present which expect this to be there. DOCKER_HOST
is not always adhered to. One example that I can give is a step/workflow which does the following:
"uses": "docker://example.com/dockerrepo/image:tag"
this is one of our deployments, it's an in-memory configuration. you can take inspiration from it I think.
the important part is to share /var/run/
between dind and the runner
template:
spec:
nodeSelector:
type: shared-16core
containers:
- command:
- /home/runner/run.sh
env:
- name: DOCKER_HOST
value: unix:///var/run/docker.sock
- name: RUNNER_WAIT_FOR_DOCKER_IN_SECONDS
value: "120"
image: eu.gcr.io/unicorn-985/docker-images_actions-runner:v1
name: runner
resources:
limits:
cpu: "4"
memory: 6Gi
requests:
cpu: 1700m
memory: 5Gi
volumeMounts:
- mountPath: /home/runner/_work
name: work
- mountPath: /var/run
name: dind-sock
- args:
- dockerd
- --host=unix:///var/run/docker.sock
- --group=$(DOCKER_GROUP_GID)
env:
- name: DOCKER_GROUP_GID
value: "123"
image: docker:dind
name: dind
securityContext:
privileged: true
volumeMounts:
- mountPath: /home/runner/_work
name: work
- mountPath: /var/run
name: dind-sock
- mountPath: /home/runner/externals
name: dind-externals
- mountPath: /var/lib/docker
name: dind-scratch
initContainers:
- args:
- -r
- -v
- /home/runner/externals/.
- /home/runner/tmpDir/
command:
- cp
image: eu.gcr.io/unicorn-985/docker-images_actions-runner:v1
name: init-dind-externals
resources: {}
volumeMounts:
- mountPath: /home/runner/tmpDir
name: dind-externals
restartPolicy: Never
volumes:
- name: dind-sock
emptyDir: {}
- name: dind-externals
emptyDir:
medium: Memory
- name: dind-scratch
emptyDir:
medium: Memory
- name: work
emptyDir:
medium: Memory
- name: tmp
emptyDir:
medium: Memory
Hey everyone, yes, the out of box socket is not positioned in the same place where runner expects it.
Unfortunately, for now, please expand the dind
spec by hand. :disappointed:
Should we update the README/docs to add this improvement?
I think the current dind
spec expansion example is not what people expect (this was at least the case when i implemented our migration).
I.e. the reporter here expected his docker plugin to work
the important part is to share /var/run/ between dind and the runner
Hello everyone. I changed "/run/docker" to "/var/run" in the dind template everywhere, but that caused the runner pods to fail immediately with a "StartError". I could not find any useful logs on why this happened from any ARC related resource. The EphemeralRunner just showed the reason as "Pod has failed to start more than 5 times". Any idea what I could be missing here? (I'm using a custom runner that just adds a few libraries on top of the default runner image)
@ananthu1834 sounds like the init containter gave a non zero exit code. you should be able to get hints as to why in the logs of that startup process
Thank you @genisd. But I did check the init container. It threw a lot of logs which are only related to the files being copied over between dirs. Besides, the dind
main container did run and threw the following logs in the end before terminating, suggesting that 1) init container passed? and 2) the dind container successfully started the docker daemon, but got a termination signal from outside?
Also, one more observation is that this problem is happening only when I use the path /var/run/docker.sock. Any other random path there seems to work, which means something very unique to this default path that is causing the termination. Sort of stuck here at the moment, will try digging deeper
Just to be sure, you're sharing the /var/run/
directory, not /var/run/docker.sock
(like in my above example)?
I tried sharing only the socket myself but that didn't work for me either, though if I recall correctly I did get conclusive errors when trying that.
You could test my example code. I think only the nodeSelector
and the runner image are specific to our environment (we simply mirror the official image such that we don't have to pull it from github. We do that because once github actions has hickups the official image repo get's overwhelmed)
Ah, the example code you provided worked! (of course, without the nodeSelector
and also with our own custom image and secrets). Digging deeper and isolating what the difference was, found that I had added a readOnly: true
for the dind-sock volume on the runner
container, as mentioned here. I had just re-used that template with just the changes I needed, like a custom image and some secrets.
Still not sure why the above change worked, but now the runner container is starting up and running as expected. Thank you @genisd for your help :)
We should update that example template I think. It's not yielding the behavior which people expect and therefore is not a good baseline to start customizing one's own environment
@genisd Thanks to share your setup, it's help me to fix same issue.
It's not better to use an host volume for your dind-scratch volume as a memroy volume to re-use cache layer ?
My final configuration (securityContext are optional) =>
githubConfigUrl: https://github.com/<my org>
githubConfigSecret: <secret name with app id & private key>
maxRunners: 28
controllerServiceAccount:
namespace: gha-runner
name: gha-runner-gha-rs-controller
listenerTemplate:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1001
runAsGroup: 123
seccompProfile:
type: RuntimeDefault
containers:
- name: listener
securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
template:
spec:
securityContext:
fsGroup: 123
seccompProfile:
type: RuntimeDefault
restartPolicy: Never
volumes:
- name: work
emptyDir:
medium: Memory
sizeLimit: "4Gi"
- name: dind-sock
emptyDir: {}
- name: dind-externals
emptyDir:
medium: Memory
initContainers:
- name: init-dind-externals
image: ghcr.io/actions/actions-runner:2.311.0
command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
volumeMounts:
- name: dind-externals
mountPath: /home/runner/tmpDir
securityContext:
runAsUser: 1001
runAsGroup: 123
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
containers:
- name: runner
image: ghcr.io/actions/actions-runner:2.311.0
command: ["/home/runner/run.sh"]
env:
- name: DOCKER_HOST
value: unix:///var/run/docker.sock
- name: RUNNER_WAIT_FOR_DOCKER_IN_SECONDS
value: "120"
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /var/run
securityContext:
runAsUser: 1001
runAsGroup: 123
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
- name: dind
image: docker:dind
args:
- dockerd
- --host=unix:///var/run/docker.sock
- --group=$(DOCKER_GROUP_GID)
env:
- name: DOCKER_GROUP_GID
value: "123"
securityContext:
privileged: true
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /var/run
- name: dind-externals
mountPath: /home/runner/externals
I agree, that the in-memory configuration is not for everyone and should not be what should end up in the README
I would say that your example, stripped to the bare minimum is what should be in the documentation as the go-to baseline template
Our organization also ran into this issue on 0.8.2
. We have a workflow with:
container: azul/zulu-openjdk:17-latest
The job that runs in this container then spins up other containers with test-containers. The initialization of these containers would fail because docker couldn't be found. We then used:
- name: DOCKER_HOST
value: unix:///run/docker/docker.sock
instead of:
- name: DOCKER_HOST
value: unix:///var/run/docker.sock
We also removed readOnly: true
. After these changes it seems like docker inside the container is now able to be accessed properly. Thank you @YvesZelros!
+1 to this being updated in the README as well as in the default values.yaml file.