agent
agent copied to clipboard
Plugin pre-exit hooks should be called when a task is cancelled
If a task is cancelled, sometimes plugin pre-exit hooks are not called. If a pre-exit hook cleans up docker state, or spins down containers, it might not happen causing bad state to persist.
Example log output:
# Received cancellation signal, interrupting
--
| 🚨 Error: Error tearing down bootstrap: The plugin docker-control pre-exit hook exited with status -1
| 2020-03-25 22:42:27 DEBUG Terminating bootstrap after cancellation with terminated
In this case the task exited with status of 255, but the agent was still available. Note that when the task exits with at status code of 1 it looks like the pre-exit hook is successfully called.
I believe this is also the case for steps that fail because of timeout (timeout_in_minutes
) which seems like it should run the pre-exit hook.
EDIT: hook execution on timeout is covered by https://github.com/buildkite/agent/issues/1202.
Any timeline for fix this issue? invoking pre-exit after cancel-grace-period
is necessary for the correctness of the system
@KevinGrandon Is that a timing issue? Do you need a longer cancel-grace-period
?
As far as I can tell, when a job is cancelled, the running command receives SIGINT, and when it exits the pre-exit
hooks are run; it's tested for in https://github.com/buildkite/agent/pull/789
However if the running command takes more than cancel-grace-period
(default 10 seconds) to exit after the SIGINT, the entire bootstrap process is torn down, so pre-exit
can't run.
It looks to me like your docker-control pre-exit hook was running, but maybe it was killed by cancel-grace-period
? And maybe that's why it's “sometimes”?
@leakingtapan We don't have any current plans to invoke pre-exit
when cancel-grace-period
has already elapsed. The grace period is for the entire job, and the hooks like pre-exit
are part of that job. Can you instead tune your cancel-grace-period
and ensure you're not running anything that ignores SIGINT?
Extending grace period is what we do now. However, we cannot ensure all the job are handling signals correctly, as we are running a multi-tenant environment
Hi, just chiming in at the request of our TAM. We're suffering from this issue as well, and we can't reliably use redundant build cancellation because skipping the pre-exit
hook of the docker-compose
plugin leaves orphan containers on the host, which degrades performance of the Docker daemon over time.
This behaviour was surprising to us, it's not reliable enough for us to up the cancel-grace-period
since there's always the risk of a job uploading a large number of artifacts, having a long post-command
hook, etc.
It would make the most sense if all of the hooks after the command
were guaranteed to run, especially pre-exit
, since it's the one that's used most often for system-wide cleanup.
Here's a minimal repro case for a 60-second cancel-grace-period
:
.buildkite/pipeline.yml
steps:
- label: 'Test'
command:
- echo 'Running command, it will take 5 minutes to complete'
- sleep 300
plugins:
- docker-compose:
run: main
config: docker-compose.yml
docker-compose.yml
version: "3.7"
services:
main:
image: "python:3.10.11"
environment:
DATABASE_URL: 'postgres://test:@postgres/test'
working_dir: /workdir
depends_on:
- postgres
postgres:
image: "postgres:12.3"
environment:
POSTGRES_DB: test
POSTGRES_USER: test
POSTGRES_HOST_AUTH_METHOD: trust
.buildkite/hooks/post-command
#!/bin/bash
set -euo pipefail
BUILDKITE_CANCEL_GRACE_PERIOD=60 # This isn't passed in, so hardcode it here
SLEEP_FOR="$((BUILDKITE_CANCEL_GRACE_PERIOD * 2))"
echo "Simulating post-command longer than grace period, perhaps from long artifact upload (${BUILDKITE_CANCEL_GRACE_PERIOD}s -> ${SLEEP_FOR}s)"
sleep "$SLEEP_FOR"
The job is cancelled before completion and killed 60 seconds afterwards. pre-exit
hooks are not run
The orphan postgres container
$ sudo docker ps --filter label=com.docker.compose.project=buildkite018aa8fb1715436db26c9aec7fc065a8
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8cd5e7a91c79 postgres:12.3 "docker-entrypoint.s…" 3 minutes ago Up 3 minutes 5432/tcp buildkite018aa8fb1715436db26c9aec7fc065a8_postgres_1
hey all! thanks for the discussion on this, it's been really helpful in shaping out discussion internally.
i can see where you're coming from on this - having old docker containers hang around after a cancelled build because the grace period wasn't long enough is a weird and bad user experience.
we've chatted a bit about this internally, and one of the key issues with changing this is doing so in a way that's backward compatible, and still makes sense for existing users. there's additional complexity internal to the agent and the way it handles signals, but i won't get further into it here other than to call out that it exists.
a potential solution we discussed was applying the cancel-grace-period to each hook in turn, rather than the entire job as a whole - this would mean that we TERM the currently running hook (including the command
hook), then wait the grace period, then KILL it. then we'd start the next hook, wait the grace period, and KILL it if it hasn't completed. the process would repeat until there were no more hooks left to run. we'd consider making this an experiment, or maybe we wouldn't and we'd just make it default behaviour.
this would be a patch fix to guarantee that each hook has the opportunity to run, but we'd likely try to rethink the hook lifecycle as a whole, and how they respond to cancellation, in a further (major) release.
@francoiscampbell et al, does that sound like a workable solution on your end? would this help?
That does make sense for the command
hook, but I'm not sure we should apply the grace period to any hooks run afterwards. In my situation, if the Docker Compose stack takes more than cancel-grace-period
seconds to shut down in the pre-exit
hook, it would still leave orphan containers.
If a timeout is desirable for hooks that are run after a cancellation, I think it should be a separate option with the ability to disable it.
A potentially hacky workaround is to add a pre-command
hook that will remove all orphaned containers:
docker ps -aq | xargs docker rm -f