elastic-ci-stack-for-aws
elastic-ci-stack-for-aws copied to clipboard
EC2 instance terminates when buildkite-agent exits with a non-zero exit code
What happens:
- The agent exits with status 1
- systemd runs the
terminate-instancescript (and then poweroff) to start taking the instance out of service - systemd restarts the agent as configured
on-failure
What should happen:
- The agent exits with status 1
- The exit code is evaluated for exit != 0 and the instance is kept in service for a restart
- systemd restarts the agent
Unfortunately we were unable to address this with the version of systemd on amazon linux 2 (see #720), however we've also realised the issue is unlikely to impact users all that often.
The most likely cause of a non-zero agent exit is a config problem (bad token, invalid git mirror path, etc) and these happen on boot before the agent has registered to receive work.
We expect users would notice this issue as a new stack booting and then terminating instances before any work is assigned, and there should be useful errors in cloudwatch logs. It's unlikely to result in an agent being assigned work as the instance is about to shutdown.