agent Plugin pre-exit hooks should be called when a task is cancelled

Plugin pre-exit hooks should be called when a task is cancelled

Open KevinGrandon opened this issue 4 years ago • 9 comments

If a task is cancelled, sometimes plugin pre-exit hooks are not called. If a pre-exit hook cleans up docker state, or spins down containers, it might not happen causing bad state to persist.

Example log output:

# Received cancellation signal, interrupting
--
  | 🚨 Error: Error tearing down bootstrap: The plugin docker-control pre-exit hook exited with status -1
  | 2020-03-25 22:42:27 DEBUG  Terminating bootstrap after cancellation with terminated

In this case the task exited with status of 255, but the agent was still available. Note that when the task exits with at status code of 1 it looks like the pre-exit hook is successfully called.

Mar 25 '20 23:03 KevinGrandon

I believe this is also the case for steps that fail because of timeout (timeout_in_minutes) which seems like it should run the pre-exit hook.

EDIT: hook execution on timeout is covered by https://github.com/buildkite/agent/issues/1202.

May 23 '20 07:05 jl-applied

Any timeline for fix this issue? invoking pre-exit after cancel-grace-period is necessary for the correctness of the system

Sep 30 '20 01:09 leakingtapan

@KevinGrandon Is that a timing issue? Do you need a longer cancel-grace-period? As far as I can tell, when a job is cancelled, the running command receives SIGINT, and when it exits the pre-exit hooks are run; it's tested for in https://github.com/buildkite/agent/pull/789 However if the running command takes more than cancel-grace-period (default 10 seconds) to exit after the SIGINT, the entire bootstrap process is torn down, so pre-exit can't run. It looks to me like your docker-control pre-exit hook was running, but maybe it was killed by cancel-grace-period? And maybe that's why it's “sometimes”?

Oct 02 '20 06:10 pda

@leakingtapan We don't have any current plans to invoke pre-exit when cancel-grace-period has already elapsed. The grace period is for the entire job, and the hooks like pre-exit are part of that job. Can you instead tune your cancel-grace-period and ensure you're not running anything that ignores SIGINT?

Oct 02 '20 06:10 pda

Extending grace period is what we do now. However, we cannot ensure all the job are handling signals correctly, as we are running a multi-tenant environment

Oct 03 '20 00:10 leakingtapan

Hi, just chiming in at the request of our TAM. We're suffering from this issue as well, and we can't reliably use redundant build cancellation because skipping the pre-exit hook of the docker-compose plugin leaves orphan containers on the host, which degrades performance of the Docker daemon over time.

This behaviour was surprising to us, it's not reliable enough for us to up the cancel-grace-period since there's always the risk of a job uploading a large number of artifacts, having a long post-command hook, etc.

It would make the most sense if all of the hooks after the command were guaranteed to run, especially pre-exit, since it's the one that's used most often for system-wide cleanup.

Here's a minimal repro case for a 60-second cancel-grace-period:

.buildkite/pipeline.yml

steps:
  - label: 'Test'
    command:
      - echo 'Running command, it will take 5 minutes to complete'
      - sleep 300
    plugins:
      - docker-compose:
          run: main
          config: docker-compose.yml

docker-compose.yml

version: "3.7"
services:
  main:
    image: "python:3.10.11"
    environment:
      DATABASE_URL: 'postgres://test:@postgres/test'
    working_dir: /workdir
    depends_on:
      - postgres
  postgres:
    image: "postgres:12.3"
    environment:
      POSTGRES_DB: test
      POSTGRES_USER: test
      POSTGRES_HOST_AUTH_METHOD: trust

.buildkite/hooks/post-command

#!/bin/bash

set -euo pipefail

BUILDKITE_CANCEL_GRACE_PERIOD=60  # This isn't passed in, so hardcode it here
SLEEP_FOR="$((BUILDKITE_CANCEL_GRACE_PERIOD * 2))"
echo "Simulating post-command longer than grace period, perhaps from long artifact upload (${BUILDKITE_CANCEL_GRACE_PERIOD}s -> ${SLEEP_FOR}s)"
sleep "$SLEEP_FOR"

The job is cancelled before completion and killed 60 seconds afterwards. pre-exit hooks are not run

The orphan postgres container

$ sudo docker ps --filter label=com.docker.compose.project=buildkite018aa8fb1715436db26c9aec7fc065a8
CONTAINER ID   IMAGE           COMMAND                  CREATED         STATUS         PORTS      NAMES
8cd5e7a91c79   postgres:12.3   "docker-entrypoint.s…"   3 minutes ago   Up 3 minutes   5432/tcp   buildkite018aa8fb1715436db26c9aec7fc065a8_postgres_1

Sep 18 '23 15:09 francoiscampbell

hey all! thanks for the discussion on this, it's been really helpful in shaping out discussion internally.

i can see where you're coming from on this - having old docker containers hang around after a cancelled build because the grace period wasn't long enough is a weird and bad user experience.

we've chatted a bit about this internally, and one of the key issues with changing this is doing so in a way that's backward compatible, and still makes sense for existing users. there's additional complexity internal to the agent and the way it handles signals, but i won't get further into it here other than to call out that it exists.

a potential solution we discussed was applying the cancel-grace-period to each hook in turn, rather than the entire job as a whole - this would mean that we TERM the currently running hook (including the command hook), then wait the grace period, then KILL it. then we'd start the next hook, wait the grace period, and KILL it if it hasn't completed. the process would repeat until there were no more hooks left to run. we'd consider making this an experiment, or maybe we wouldn't and we'd just make it default behaviour.

this would be a patch fix to guarantee that each hook has the opportunity to run, but we'd likely try to rethink the hook lifecycle as a whole, and how they respond to cancellation, in a further (major) release.

@francoiscampbell et al, does that sound like a workable solution on your end? would this help?

Sep 20 '23 04:09 moskyb

That does make sense for the command hook, but I'm not sure we should apply the grace period to any hooks run afterwards. In my situation, if the Docker Compose stack takes more than cancel-grace-period seconds to shut down in the pre-exit hook, it would still leave orphan containers.

If a timeout is desirable for hooks that are run after a cancellation, I think it should be a separate option with the ability to disable it.

Sep 26 '23 19:09 francoiscampbell

A potentially hacky workaround is to add a pre-command hook that will remove all orphaned containers:

docker ps -aq | xargs docker rm -f

Sep 27 '23 16:09 KaoruDev

agent agent copied to clipboard

Plugin pre-exit hooks should be called when a task is cancelled

agent
agent copied to clipboard