agent icon indicating copy to clipboard operation
agent copied to clipboard

Hot-reloadable config, or graceful agent restart?

Open mallyvai opened this issue 8 years ago • 6 comments

Hi BK team! We're currently evaluating BuildKite to see if it's a good fit for our build-deploy process long-term. Right now we're in the following situation we need to account for:

  • We have an error or potential security issue with the buildkite-agent.cfg config file / environment config
  • We want to reload the agents with the rolled-back or updated config
  • We don't want to interrupt any builds that are in-flight

Right now our only option is to force-stop / start agents en masse, but this runs the risk of killing inflight jobs.

Is there a mechanism to either

Hot-reload the configuration (buildkite-agent.cfg in our case) Be able to issue a command to live running agents to restart and pick up latest settings/environment/etc once the current build they're running has completed?

Thank you!

mallyvai avatar May 19 '17 21:05 mallyvai

Nice fit for blue-green deploys if you're running your builds in a cluster of containers or VMs. Roll out the cfg change by building & deploying new containers and gracefully shutting down the running instances (allowing agents to complete their current jobs).

jeremy avatar Jun 08 '17 02:06 jeremy

I'm not sure that we document it very well, but you can gracefully stop an agent and it will wait for builds to finish. That said, I think either live config reload or a USR1sig would make a lot of sense!

lox avatar Nov 04 '17 00:11 lox

Note that while the agent itself supports graceful stops, the init scripts that come with it do not: the timeout before giving up and sending sigkill to the agent is very low and the systemd one defaults to sending sigterm to the whole process group, not just the agent, which means all your build scripts would have to handle sigterm as well.

uri-canva avatar Dec 19 '17 04:12 uri-canva

Hmmm, we should definitely look at that @uri-canva!

lox avatar Dec 21 '17 00:12 lox

Just had a sneaky chat with @lox about this, and I think the answer for us is to support a graceful restart of the agent, which would reload any config changes if you're using a config file. We've slated this for a potential 4.0 release.

keithpitt avatar Mar 13 '18 01:03 keithpitt

The following systemd service drop-in under /etc/systemd/system/buildkite-agent.d/timeout.conf works fine for me to get a graceful (but guaranteed) job cancellation.

[Service]
TimeoutStopSec=5min
KillMode=mixed

MartinNowak avatar Sep 02 '18 15:09 MartinNowak