Anonymous volumes not cleaned up after containers are removed
When using a service container with a VOLUME declaration in its dockerfile
(both the postgresql and redis containers used in the examples at
https://docs.github.com/en/actions/using-containerized-services do this), an
anonymous volume is automatically created together with the container.
At the end of the workflow, the container is stopped and removed, but the anonymous volumes it created stay around.
In our case, we use a postgres service as a throwaway database for some tests,
which results in many smallish anonymous volumes accumulating on the systems
that host our GH runners.
At the moment we work around this with manual cleanup or cronjobs, but ideally these containers would be fully removed at the end of workflows.
I tried passing --rm to jobs.<job_id>.services.<service_id>.options, but
that ends up as an argument to docker create. The runner creates, starts,
stops and removes containers with separate commands, so passing --rm to
docker create has no effect on docker remove.
Would it be an option to change DockerRemove to use the -v option?
https://github.com/actions/runner/blob/628f462ab709492bf03b149468ef18415f9bc1bb/src/Runner.Worker/Container/DockerCommandManager.cs#L266-L269
From the docker rm docs:
--volumes , -v Remove anonymous volumes associated with the container
Any idea for a workaround would also be appreciated, I don't see any other way to have these anonymous volumes removed.
Thank you!
To Reproduce
Run a workflow with a container which uses VOLUME, for example (from https://docs.github.com/en/actions/using-containerized-services/creating-postgresql-service-containers):
container: node:10.18-jessie
# Service containers to run with `container-job`
services:
# Label used to access the service container
postgres:
# Docker Hub image
image: postgres
# Provide the password for postgres
env:
POSTGRES_PASSWORD: postgres
# Set health checks to wait until postgres has started
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
After the workflow has finished, check with docker system df -v on the machine that ran the job, the anonymous volume created by the service container is still around:
VOLUME NAME LINKS SIZE
b9837a9103b4f4a863a93dd952e7b8777c10274fc428cac4760ae79f13ac5826 0 43.2MB
Expected behavior
Anonymous volumes created automatically by the service container should be cleaned up.
Runner Version and Platform
2.291.1, Linux
Hi @brainlock,
Thanks for reporting this issue! I added it to the board, so we're going to work on it in the near future. I'll let you know when we have more information or when we create PR that will fix this.
I have the same issue...
Does anyone have a workaround for this in the meantime? Seems like this issue stalled out a bit.
@jromero-pg I've got a work around, but this is purely a bandaid and there may be a better way, or issues with this I haven't ran into yet.
My current work around is to add a cleanup script via ACTIONS_RUNNER_HOOK_JOB_COMPLETED and either prune all unused anonymous volumes, or to give the volumes calculable names and delete them that way.
A cronjob might honestly be way easier depending on what your environment looks like.
Work Arounds
I'll just pretend we're in the home dir with an actions-runner folder as an example. And I'll pretend you're in a vm dedicated to github actions.
Make a runner_cleanup.sh script in your home directory.
Edit the .env file in your actions directory to add the completed hook:
echo "ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/runner/actions-root/runner_cleanup.sh" >> /home/runner/actions-runner/.env
Quick and Dirty method - Prune all unused anonymous volumes
This method would prune every unused anonymous volume on the system. This could be dangerous if your vm is used for things other then github actions!
Because this method should leave actively used volumes alone, it should be fine to run after every run.
Add this to runner_cleanup.sh
#!/usr/bin/env bash
exec 100>/tmp/docker-prune.lock || exit 1
flock -w 100 100 || exit 1
docker volume prune -f
The lock is required as otherwise you'll get this error:
Error response from daemon: a prune operation is already running
A More "Proper" work-around
This method is "cleaner" because you're explicitly naming the volumes and then deleting them by name. I'm not sure this method really provides that much benefit, and doesn't cover a condition where the github-runner itself crashes. It also requires you to know the paths of any volumes in your images.
Add this to runner_cleanup.sh
#!/usr/bin/env bash
# Make the slug unique per service container type.
TARGET_VOLUME="postgres-$GITHUB_SHA-$GITHUB_RUN_ID-$GITHUB_RUN_NUMBER-$GITHUB_RUN_ATTEMPT"
echo "Deleting $TARGET_VOLUME"
docker volume rm -f "$TARGET_VOLUME"
Find the image you're using to find what directory it's binding, for postgres it looks like:
VOLUME /var/lib/postgresql/data
Set the corresponding name in your workflow:
services:
postgres:
image: postgres:16-alpine
volumes:
- "postgres-${{github.sha}}-${{github.run_id}}-${{github.run_number}}-${{github.run_attempt}}:/var/lib/postgresql/data"
https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/running-scripts-before-or-after-a-job
EDIT: Changed the flock arguments to wait 100 seconds instead of dying instantly.
We were in the process of setting up our self-hosted runner, but the server quickly (after a few runs) ran out of disk space. Checking /var/lib/docker, it had 12GB of unused volumes (201MB each) and 6GB of images.
An example job:
jobs:
js-lint:
if: github.event.pull_request.draft == false
runs-on: self-hosted
container:
image: ubuntu:22.04
options: --cpus 2
timeout-minutes: 10
Some jobs also have a service defined:
services:
mysql:
image: mysql:8
env:
MYSQL_ROOT_PASSWORD: user
MYSQL_DATABASE: user_test
options: >-
--health-cmd="mysqladmin ping -h 127.0.0.1 -uroot -ppassword --protocol=tcp"
--health-interval=10s
--health-timeout=5s
--health-retries=20
@ruvceskistefan Any news on this issue?
Any news on the subject @ruvceskistefan ? Other than the anonymous volume created, when we create a volume it also doesn't destroy it, for instance
mongo:
image: mongo:7.0.6
volumes:
- mongo_volume:/data/db
ports:
- 27017:27017
That means always killing the container to remove the mongo_volume before the automatic Stop containers step. That generates warnings on the Job. We have mongodb and elasticsearch running, and cleaning up means having these:
hello @ruvceskistefan any updates?