elastic-ci-stack-for-aws icon indicating copy to clipboard operation
elastic-ci-stack-for-aws copied to clipboard

EC2 instance terminates when buildkite-agent exits with a non-zero exit code

Open chloeruka opened this issue 5 years ago • 1 comments

What happens:

  1. The agent exits with status 1
  2. systemd runs the terminate-instance script (and then poweroff) to start taking the instance out of service
  3. systemd restarts the agent as configured on-failure

What should happen:

  1. The agent exits with status 1
  2. The exit code is evaluated for exit != 0 and the instance is kept in service for a restart
  3. systemd restarts the agent

chloeruka avatar Sep 29 '20 07:09 chloeruka

Unfortunately we were unable to address this with the version of systemd on amazon linux 2 (see #720), however we've also realised the issue is unlikely to impact users all that often.

The most likely cause of a non-zero agent exit is a config problem (bad token, invalid git mirror path, etc) and these happen on boot before the agent has registered to receive work.

We expect users would notice this issue as a new stack booting and then terminating instances before any work is assigned, and there should be useful errors in cloudwatch logs. It's unlikely to result in an agent being assigned work as the instance is about to shutdown.

yob avatar Oct 12 '20 03:10 yob