agent
agent copied to clipboard
Hot-reloadable config, or graceful agent restart?
Hi BK team! We're currently evaluating BuildKite to see if it's a good fit for our build-deploy process long-term. Right now we're in the following situation we need to account for:
- We have an error or potential security issue with the buildkite-agent.cfg config file / environment config
- We want to reload the agents with the rolled-back or updated config
- We don't want to interrupt any builds that are in-flight
Right now our only option is to force-stop / start agents en masse, but this runs the risk of killing inflight jobs.
Is there a mechanism to either
Hot-reload the configuration (buildkite-agent.cfg in our case) Be able to issue a command to live running agents to restart and pick up latest settings/environment/etc once the current build they're running has completed?
Thank you!
Nice fit for blue-green deploys if you're running your builds in a cluster of containers or VMs. Roll out the cfg change by building & deploying new containers and gracefully shutting down the running instances (allowing agents to complete their current jobs).
I'm not sure that we document it very well, but you can gracefully stop an agent and it will wait for builds to finish. That said, I think either live config reload or a USR1sig would make a lot of sense!
Note that while the agent itself supports graceful stops, the init scripts that come with it do not: the timeout before giving up and sending sigkill to the agent is very low and the systemd one defaults to sending sigterm to the whole process group, not just the agent, which means all your build scripts would have to handle sigterm as well.
Hmmm, we should definitely look at that @uri-canva!
Just had a sneaky chat with @lox about this, and I think the answer for us is to support a graceful restart of the agent, which would reload any config changes if you're using a config file. We've slated this for a potential 4.0 release.
The following systemd service drop-in under /etc/systemd/system/buildkite-agent.d/timeout.conf works fine for me to get a graceful (but guaranteed) job cancellation.
[Service]
TimeoutStopSec=5min
KillMode=mixed