runner Anonymous volumes not cleaned up after containers are removed

When using a service container with a VOLUME declaration in its dockerfile (both the postgresql and redis containers used in the examples at https://docs.github.com/en/actions/using-containerized-services do this), an anonymous volume is automatically created together with the container.

At the end of the workflow, the container is stopped and removed, but the anonymous volumes it created stay around.

In our case, we use a postgres service as a throwaway database for some tests, which results in many smallish anonymous volumes accumulating on the systems that host our GH runners.

At the moment we work around this with manual cleanup or cronjobs, but ideally these containers would be fully removed at the end of workflows.

I tried passing --rm to jobs.<job_id>.services.<service_id>.options, but that ends up as an argument to docker create. The runner creates, starts, stops and removes containers with separate commands, so passing --rm to docker create has no effect on docker remove.

Would it be an option to change DockerRemove to use the -v option?

https://github.com/actions/runner/blob/628f462ab709492bf03b149468ef18415f9bc1bb/src/Runner.Worker/Container/DockerCommandManager.cs#L266-L269

From the docker rm docs:

--volumes , -v Remove anonymous volumes associated with the container

Any idea for a workaround would also be appreciated, I don't see any other way to have these anonymous volumes removed.

Thank you!

To Reproduce

Run a workflow with a container which uses VOLUME, for example (from https://docs.github.com/en/actions/using-containerized-services/creating-postgresql-service-containers):

    container: node:10.18-jessie
    # Service containers to run with `container-job`
    services:
      # Label used to access the service container
      postgres:
        # Docker Hub image
        image: postgres
        # Provide the password for postgres
        env:
          POSTGRES_PASSWORD: postgres
        # Set health checks to wait until postgres has started
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

After the workflow has finished, check with docker system df -v on the machine that ran the job, the anonymous volume created by the service container is still around:

VOLUME NAME                                                        LINKS     SIZE
b9837a9103b4f4a863a93dd952e7b8777c10274fc428cac4760ae79f13ac5826   0         43.2MB

Expected behavior

Anonymous volumes created automatically by the service container should be cleaned up.

Runner Version and Platform

2.291.1, Linux

May 11 '22 12:05 brainlock

Hi @brainlock,

Thanks for reporting this issue! I added it to the board, so we're going to work on it in the near future. I'll let you know when we have more information or when we create PR that will fix this.

May 11 '22 16:05 ruvceskistefan

I have the same issue...

May 24 '22 11:05 zivkovic

Does anyone have a workaround for this in the meantime? Seems like this issue stalled out a bit.

Apr 18 '23 15:04 jromero-pg

@jromero-pg I've got a work around, but this is purely a bandaid and there may be a better way, or issues with this I haven't ran into yet.

My current work around is to add a cleanup script via ACTIONS_RUNNER_HOOK_JOB_COMPLETED and either prune all unused anonymous volumes, or to give the volumes calculable names and delete them that way.

A cronjob might honestly be way easier depending on what your environment looks like.

Work Arounds

I'll just pretend we're in the home dir with an actions-runner folder as an example. And I'll pretend you're in a vm dedicated to github actions.

Make a runner_cleanup.sh script in your home directory.

Edit the .env file in your actions directory to add the completed hook:

echo "ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/runner/actions-root/runner_cleanup.sh" >> /home/runner/actions-runner/.env

Quick and Dirty method - Prune all unused anonymous volumes

This method would prune every unused anonymous volume on the system. This could be dangerous if your vm is used for things other then github actions!

Because this method should leave actively used volumes alone, it should be fine to run after every run.

Add this to runner_cleanup.sh

#!/usr/bin/env bash

exec 100>/tmp/docker-prune.lock || exit 1
flock -w 100 100 || exit 1

docker volume prune -f

The lock is required as otherwise you'll get this error:

Error response from daemon: a prune operation is already running

A More "Proper" work-around

This method is "cleaner" because you're explicitly naming the volumes and then deleting them by name. I'm not sure this method really provides that much benefit, and doesn't cover a condition where the github-runner itself crashes. It also requires you to know the paths of any volumes in your images.

Add this to runner_cleanup.sh

#!/usr/bin/env bash

# Make the slug unique per service container type.
TARGET_VOLUME="postgres-$GITHUB_SHA-$GITHUB_RUN_ID-$GITHUB_RUN_NUMBER-$GITHUB_RUN_ATTEMPT"
echo "Deleting $TARGET_VOLUME"
docker volume rm -f "$TARGET_VOLUME"

Find the image you're using to find what directory it's binding, for postgres it looks like:

VOLUME /var/lib/postgresql/data

Set the corresponding name in your workflow:

services:
  postgres:
    image: postgres:16-alpine
    volumes:
      - "postgres-${{github.sha}}-${{github.run_id}}-${{github.run_number}}-${{github.run_attempt}}:/var/lib/postgresql/data"

https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/running-scripts-before-or-after-a-job

EDIT: Changed the flock arguments to wait 100 seconds instead of dying instantly.

Jan 02 '24 20:01 Penagwin

We were in the process of setting up our self-hosted runner, but the server quickly (after a few runs) ran out of disk space. Checking /var/lib/docker, it had 12GB of unused volumes (201MB each) and 6GB of images.

An example job:

jobs:
  js-lint:
    if: github.event.pull_request.draft == false
    runs-on: self-hosted
    container:
      image: ubuntu:22.04
      options: --cpus 2
    timeout-minutes: 10

Some jobs also have a service defined:

    services:
      mysql:
        image: mysql:8
        env:
          MYSQL_ROOT_PASSWORD: user
          MYSQL_DATABASE: user_test
        options: >-
          --health-cmd="mysqladmin ping -h 127.0.0.1 -uroot -ppassword --protocol=tcp"
          --health-interval=10s
          --health-timeout=5s
          --health-retries=20

Feb 07 '24 21:02 sebastiaanluca

@ruvceskistefan Any news on this issue?

May 17 '24 16:05 wkhadgar

Any news on the subject @ruvceskistefan ? Other than the anonymous volume created, when we create a volume it also doesn't destroy it, for instance

      mongo:
        image: mongo:7.0.6
        volumes:
          - mongo_volume:/data/db
        ports:
          - 27017:27017

That means always killing the container to remove the mongo_volume before the automatic Stop containers step. That generates warnings on the Job. We have mongodb and elasticsearch running, and cleaning up means having these: Screenshot 2024-08-05 at 17 32 51

Aug 05 '24 16:08 tiagoliveira9

hello @ruvceskistefan any updates?

Jan 21 '25 13:01 jonathansp