elastic-ci-stack-for-aws icon indicating copy to clipboard operation
elastic-ci-stack-for-aws copied to clipboard

On Windows, spot termination leaves hanging agents if "stop-agent-gracefully" blocks for more than the allotted 2 minutes

Open yob opened this issue 3 years ago • 0 comments

#733 describes a bug on linux that has been solved, but the bug remains on Windows.

  1. The stack is using spot instances
  2. A spot termination is pending, and lifecycled sends a stop signal to the agent, which will either quit if there's no job running, or schedule a quit when the current job finishes
  3. If the current job is still running after two minutes, the instance will be hard terminated and the agent won't deregister with the buildkite serers. That means it'll look like the job is still running for ~5 minutes until the buildkite servers mark the agent as lost and cancel the job.

#733 was fixed for linux in #737, and we need to try a similar approach in Windows land.

yob avatar Oct 26 '20 04:10 yob