runner icon indicating copy to clipboard operation
runner copied to clipboard

Support Runner inside of Docker Container

Open jpb opened this issue 4 years ago • 46 comments

Describe the enhancement

Fully support all features when runner is within a Docker container.

Not all features are currently supported when the runner is within a Docker container, specifically those features that use Docker like Docker-based Actions and services. Running self-hosted runners using Docker is an easy way to scale out runners on some sort of Docker-based cluster and an easy way to provide clean workspaces for each run (with ./run.sh --once).

Code Snippet

Possible implementation that I am using now.

Additional information

There are a few areas of concern when the runner executes in a Docker container:

  1. Filesystem access for other containers needed as part of the job. This can be resolved by using a volume mount from the host which uses a matching host and container path (for example: docker run -v /home/github:/home/github, although it doesn't have to be this exact directory) and telling the runner to use a directory within that for the work directory (./config.sh --work /home/github/work). This works with the current volume mounting behaviour for containers created by the runner. This would need to be documented as part of the setup process for a Docker-based runner.
  2. Network between runner and other containers needed as part of the job. This could be resolved by not creating a network as part of the run and instead optionally accepting an existing network to be used. I have found that it works well to use --network container:<container ID of the runner> to reuse the network from the runner container without having to orchestrate a network created via docker network create. There is no straightforward way to discover the network or ID of a container from within it, so it would likely need to be the responsibility of the user to pass this information to the runner (I current do something like "container:$(cat /proc/self/cgroup | grep "cpu" | head -n 1 | rev | cut -d/ -f 1 | rev)" from within the runner container to find the ID and pass this to the runner, although this isn't guaranteed to work in all cases).

jpb avatar Apr 03 '20 21:04 jpb

There appear to be a couple more things that need to be done to account for multiple runners on the same host concurrently:

  1. docker network prune can not run concurrently and should likely be retried if such an error is received:
    /usr/local/bin/docker network prune --force --filter "label=898d1dec6adc"
    Error response from daemon: a prune operation is already running
    ##[warning]Delete stale container networks failed, docker network prune fail with exit code 1
    
  2. The docker label is not sufficient for isolating separate runners on the same host. The current hash of the root directory will result in the same label being used for all runners with the exact same version. In my testing I've switched this to use the hostname, but perhaps something like the runner name or run ID could be used.

jpb avatar Apr 07 '20 05:04 jpb

@TingluoHuang @bryanmacfarlane I'm hoping to get your feedback on this – getting official support for this would be a huge help for me. I'm happy to work on an implementation if that is helpful.

jpb avatar May 13 '20 18:05 jpb

This is a big problem for us. We want to run gh runner in docker to easier scaling and isolation. But we need to also run services for tests. So our workaround for now is to run multiple runners on host, but scaling container with docker-compose is so much easier and convenient.

SonicGD avatar Jun 08 '20 07:06 SonicGD

This is a big problem for us. We want to run gh runner in docker to easier scaling and isolation. But we need to also run services for tests. So our workaround for now is to run multiple runners on host, but scaling container with docker-compose is so much easier and convenient.

Fully agree, for the time being we have build a scalable solution on AWS spot to server our docker builds. A detailed blog post en ref to the code https://040code.github.io/2020/05/25/scaling-selfhosted-action-runners

npalm avatar Jun 08 '20 11:06 npalm

This is a big problem for us. We want to run gh runner in docker to easier scaling and isolation. But we need to also run services for tests. So our workaround for now is to run multiple runners on host, but scaling container with docker-compose is so much easier and convenient.

Currently we have to create workarounds using non-optimal solutions to deploy tens of runners - or workarounds for container usage in jobs which is rather ugly. How to raise priority of this ?

Just curious, how other's manage scaling the runners ? This is probably one of the most interested approach so far I've seen.. I guess many of us faced this same challenge when scaling gh-runners.. "Official" scaling proposals from GitHub would be more than welcome :) .

jupe avatar Feb 16 '21 12:02 jupe

This is a big problem for us. We want to run gh runner in docker to easier scaling and isolation. But we need to also run services for tests. So our workaround for now is to run multiple runners on host, but scaling container with docker-compose is so much easier and convenient.

Big kudos for @npalm and its solution on AWS. We also build a similar solution for GCP allowing us to scale our self hosted runners for a whole GitHub organization https://github.com/faberNovel/terraform-gcp-github-runner

vincentbrison avatar Feb 19 '21 09:02 vincentbrison

@jupe we use https://github.com/summerwind/actions-runner-controller which has worked really well for us so far

callum-tait-pbx avatar Apr 17 '21 11:04 callum-tait-pbx

waiting for this one so bad

pratikbin avatar Jun 27 '21 07:06 pratikbin

@jpb Is there any possibility to get a higher priority on this?

uwehdaub avatar Jul 02 '21 07:07 uwehdaub

For now we will use some workaround based on docker-compose. We have the following docker-compose.yaml file in the repo to setup the services.

version: "3.3"
services:
  nginx:
    image: nginx
  redis:
    image: redis

We then connect the self-hosted runner which is also running inside docker with the created network. This is one example WF:

name: Start docker compose
on:
  workflow_dispatch:
jobs:
  start-docker-compose:
    # should run as docker container with connection to the dockerd of the host
    runs-on: [self-hosted] 
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
        with:
          fetch-depth: 1
      - name: Start docker compose
        id: start-docker-compose
        run: |
          project_prefix=my-project
          project_name="${project_prefix}-$(head /dev/urandom | tr -dc A-Za-z0-9 | head -c 13 | tr '[:upper:]' '[:lower:]')"
          my_container_id=$(grep docker /proc/self/cgroup | head -n 1 | sed "s|^.*/docker/\(.*\)|\\1|")

          docker-compose -p "${project_name}" up -d
          while ! docker network inspect "${project_name}_default" > /dev/null ; do
            sleep 1
          done

          docker network connect "${project_name}_default" "${my_container_id}"

          echo "::set-output name=my_container_id::$my_container_id"
          echo "::set-output name=project_name::$project_name"

      - name: Check output
        run: |
          echo "Project name: ${{ steps.start-docker-compose.outputs.project_name}}"
          echo "Container ID: ${{ steps.start-docker-compose.outputs.remote_container_id}}"

      - name: Use the started docker compose services
        run: |
          # Install netcat to check redis
          apt-get update
          apt-get install -y netcat
          echo '--------------------'
          ping -c 1 nginx
          curl nginx
          echo '--------------------'
          ping -c 1 redis
          echo ping | netcat -w 2 redis 6379
      - name: Cleanup started docker compose services
        if: always()
        run: |
          docker network disconnect ${{ steps.start-docker-compose.outputs.project_name}}_default ${{ steps.start-docker-compose.outputs.my_container_id}}
          docker-compose -p ${{ steps.start-docker-compose.outputs.project_name}} down

uwehdaub avatar Jul 08 '21 09:07 uwehdaub

@jpb , since you asked me, I'm ➕ on this, But adding @hross to weigh in since he's driving the runner area now. 🚀

bryanmacfarlane avatar Jul 09 '21 01:07 bryanmacfarlane

We still want to do this and it's on our list but we don't have a date or schedule for shipping this type of feature right now.

hross avatar Jul 12 '21 11:07 hross

Would love to see this prioritized. Can't really run docker-in-docker on Kubernetes self-hosted runners without this.

brandonschabell avatar Jul 13 '21 23:07 brandonschabell

Any update on this issue?

nehagargSeequent avatar Aug 18 '21 20:08 nehagargSeequent

Ping @bryanmacfarlane =)

myoung34 avatar Oct 07 '21 12:10 myoung34

Plus one here, any update ETA? @hross

na-jakobs avatar Nov 17 '21 22:11 na-jakobs

Another +1 here, for me this is blocking some 3rd-party deployment workflows with private AKS clusters

pl4nty avatar Dec 18 '21 22:12 pl4nty

another +1 again, waiting for this feature so badly !

ixxeL2097 avatar Dec 19 '21 14:12 ixxeL2097

+1

salim97 avatar Dec 28 '21 22:12 salim97

@hross at least according to this thread, you are the most recent driver on this topic. Has anything changed since your July 2021 post regarding prioritisation? Thanks for your assistance!

kelseymok avatar Jan 10 '22 17:01 kelseymok

Just started to look into Runners within Containers. Seems that full support would be a great help.

alebeauthermo avatar Jan 27 '22 20:01 alebeauthermo

This issue has been opened for 2 years now, is there any plan to support it? It seems like a lot of people are requesting it

BlueskyFR avatar May 03 '22 23:05 BlueskyFR

@nikola-jokic @fhammerl @TingluoHuang @ruvceskistefan @thboop may we have someone assigned to this issue, to at least take a decision about what is going to happen? Sorry for the ping but something must be done here IMO

BlueskyFR avatar May 04 '22 12:05 BlueskyFR

A workaround is to do all into a docker-compose and then use the option --exit-code-from available into the docker-compose command

--exit-code-from: Return the exit code of the selected service container. Implies --abort-on-container-exit

For example (need modifications) :

docker-compose :

version: "3"
services:
  postgresql:
    image: postgres:latest
    environment:
      - POSTGRES_DB=...
      - POSTGRES_USER=...
      - POSTGRES_PASSWORD=...
  server:
    build: .
    volumes:
      - src:/src
      - ./cache/gradle:/root/.gradle
    environment:
      - DB_HOST=postgresql
      - DB_NAME=...
      - DB_USER=...
      - DB_PASSWORD=...
    command: ["bash", "-c", "./wait-for-it.sh -t 0 postgresql:5432 -- ./gradlew test --info"]

Workflow :

name: Server QA
on:
  push:

jobs:
  test:
    runs-on: self-hosted
    steps:
      - name: Run tests
        run: docker-compose up --exit-code-from server --force-recreate --build

brandonfl avatar Jun 22 '22 15:06 brandonfl

This is a big problem for us. We want to run gh runner in docker to easier scaling and isolation. But we need to also run services for tests. So our workaround for now is to run multiple runners on host, but scaling container with docker-compose is so much easier and convenient.

Fully agree, for the time being we have build a scalable solution on AWS spot to server our docker builds. A detailed blog post en ref to the code https://040code.github.io/2020/05/25/scaling-selfhosted-action-runners

Makes sense, I did something similar for DNS Resolver ENIS with Cloudwatch as inputs.

ecout avatar Sep 07 '22 23:09 ecout

The main issue I see with all this is access to docker.sock...the whole docker in docker with root access scenario. https://github.com/myoung34/docker-github-actions-runner/issues/61 From the examples mentioned here: https://github.com/actions/runner/issues/367#issuecomment-597742895 You can try rootless, https://docs.docker.com/engine/security/rootless/#rootless-docker-in-docker

But then you run into limitations. So a docker container "runner" running into another docker rootless container inside your typical rooted docker, can you even do docker build then? Buildkit: https://www.containiq.com/post/docker-alternatives

And apparently some things have stopped working: https://github.com/actions/runner/issues/2103

And then again, at the end of the day, you'll want container orchestration to bring your runners up and down.

Can your actions consider docker alternatives to build images with a container runner?

https://snyk.io/blog/building-docker-images-kubernetes/

For our team specifically we do want container runners that are also able to run containers.

ecout avatar Sep 07 '22 23:09 ecout

Any update on this or new news?

alexjoeyyong avatar Jan 18 '23 17:01 alexjoeyyong

Also need to plus one this issue. I've tried every workaround including the latest changes to

https://github.com/actions/runner/blob/main/images/Dockerfile by the @TingluoHuang and the team, but having the CLI isn't really too much use unless we can run docker pull xxx. More specifically when anyone is developing actions they have to be hyper aware of what the action is written in.

AJMcKane avatar Feb 25 '23 15:02 AJMcKane

I really want to run my jobs using that container feature :(

...
jobs:
  job:
    runs-on:
      labels:
        - self-hosted
        - linux
        - ${{ inputs.RUNNER_LABEL }}
    container:
      image: ${{ inputs.DOCKER_IMAGE }}
      credentials:
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}

    steps:
      - run: |
          sfdx version --json

Since I can't execute jobs that run inside containers, and my workflows don't need any other dockerized services, I can get over this limitation by just creating PODs with a docker image that has the runner + everything else that my docker image has that is necessary to run the job. The downside is that I need to create a new image. The following image shows exactly what I'm thinking about:

image

OBS: I don't need more than one node. If an entire node goes down, I can just wait for EKS to recreate it, as well as its PODs

With the workaround architecture in place, I can then remove the container configuration from the job manifest.

...
jobs:
  job:
    runs-on:
      labels:
        - self-hosted
        - linux
        - ${{ inputs.RUNNER_LABEL }}

    steps:
      - run: |
          sfdx version --json

OBS: I'm just not sure if the storage is going to, somehow, be shared by the PODs or if they are unique, even when using the same name. If the storage is shared between PODs, then one job could impact another one if both PODs run on the same Node.

AllanOricil avatar May 03 '23 01:05 AllanOricil

I have a POD running a github-runner and a dind container. When a job that runs on a container is taken by the github-runner container, the job can't execute a simple inline script such as echo hello. Am I doing something wrong, or is it a problem caused by this issue as stated in this other issue?

In the following image you can see that the Container is created by the dind container without a problem, but the inline script can't be executed inside the container

image

This is my kubernetes deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: github-runner-public
  namespace: github-runners
  labels:
    app: github-runner-public
spec:
  replicas: 1
  selector:
    matchLabels:
      app: github-runner-public
  template:
    metadata:
      labels:
        app: github-runner-public
    spec:
      nodeSelector:
        eks.amazonaws.com/nodegroup: public-nodegroup
      containers:
        - name: github-runner
          image: 225537886698.dkr.ecr.eu-west-1.amazonaws.com/github-runner-test:v1.1.3
          env:
            - name: DOCKER_HOST
              value: tcp://localhost:2375
            - name: DOCKER_API_VERSION
              value: "1.42"
          volumeMounts:
            - name: runner-workspace
              mountPath: /actions-runner/_work
        - name: dind-daemon
          image: docker:23.0.5-dind
          command: ["dockerd", "--host", "tcp://127.0.0.1:2375"]
          securityContext:
            privileged: true
          volumeMounts:
            - name: docker-graph-storage
              mountPath: /var/lib/docker
      volumes:
        - name: docker-graph-storage
          emptyDir: {}
        - name: runner-workspace
          emptyDir: {}

Why/how can't the container that runs inside the dind not have access to /actions-runner/_work/_temp from github-runner? I don't get it. After reading this post I understood that the directories from the container inside the dind would be mapped to the directories inside the github-runner. So, If the container created by dind is mapping /actions-runner/_work from the github-runner container to /__w that is inside the container as shown below, why isn't the /__w/__temp/<bla>.sh available?

/actions-runner/_work (volume in the node) <- github-runer [/actions-runner/_work] -> (dind) -> my-container [/__w]

Shouln't /actions-runner/_work/_temp/<bla>.sh be inside the volume?

image

This is the content of the /actions-runner/_work/_temp directory inside the github-runner container. For some reason it is empty. Does this mean that the controller can't create the inline script inside the runner when it is running in a container? image

AllanOricil avatar May 03 '23 09:05 AllanOricil