agent icon indicating copy to clipboard operation
agent copied to clipboard

Agent disconnect after idle exits non-zero

Open ohookins opened this issue 2 years ago • 10 comments

Agent version: v3.33.3

Sadly I don't have a whole lot of diagnostic information but I'm happy to generate more and provide it. The last few months we have been using the self-termination option as documented in several of the posts about more effective scaling policies. We solely scale down our buildkite agent pool by configuring the agent to self-terminate after 25 minutes of idle, and having the systemd hook to remove the agent from the ASG.

Not often, but enough to be annoying, an agent will terminate in the middle of a job. Currently the only records we have of this are in the Buildkite job log which simply reads Exited with status -1 (agent lost).

If there are some basic debug logs we can enable that would capture what's going on here, I'd be all for it.

EDIT: As you can see from my later comments, this is really about the agent exiting non-zero from a disconnection, which triggers systemd to restart the unit - thus causing our early termination when the agent receives another job.

ohookins avatar Nov 11 '21 02:11 ohookins

@ohookins are you using the Elastic CI Stack for AWS? If so, could you share the parameters for your stack ensuring not to include any agent token value?

Do you use on-demand or spot EC2 instances?

Are your agent logs being streamed to CloudWatch? If so, could you grab a copy of the agent logs on an instance where this has happened so we can reconstruct a timeline?

keithduncan avatar Nov 11 '21 03:11 keithduncan

We're not using the elastic CI stack, although the config files we are using for systemd agent termination, and the agent settings themselves to set that up are directly taken from the elastic CI stack. It would be a large effort to migrate over unfortunately so we're continuing with our custom setup.

Currently we are using on-demand instances so we don't have any instances being taken away early - it should only be the agents themselves self-terminating.

I'll try to grab some logs from when this is happening and forward it on.

ohookins avatar Nov 11 '21 03:11 ohookins

2021-11-12 02:11:29 INFO   ip-10-10-128-133-1 All agents have been idle for 1543 seconds. Disconnecting...
2021-11-12 02:11:29 INFO   ip-10-10-128-133-1 Disconnecting...
sleeping for 10 seconds before terminating instance to allow agent logs to drain to cloudwatch...
_) \|                              \| \|  \| \|__  _   _ _\| \| __\| \| \| ___\| \|_ ___    __ _  __ _  ___ _ __ \| \|_  \| '_ \\| \| \| \| \| \|/ _` \| \|/ / \| __/ _ \  / _` \|/ _` \|/ _ \ '_ \\| __\|  \| \|_) \| \|_\| \| \| \| (_\| \|   <\| \| \|\|  __/ \| (_\| \| (_\| \|  __/ \| \| \| \|_  \|_.__/ \__,_\|_\|_\|\__,_\|_\|\_\_\|\__\___\|  \__,_\|\__, \|\___\|_\| \|_\|\__\|                                                 __/ \| https://buildkite.com/agent                    \|___/
2021-11-12 01:45:37 NOTICE Starting buildkite-agent v3.33.3 with PID: 1681

Actually the logs suggest that the sleep of 10 seconds is too long, and the systemd unit for buildkite-agent is restarting the service after the configured value of 5 seconds, which means the instance is scheduled for shutdown but technically buildkite-agent restarts and starts accepting new jobs.

I'm going to tune this in our environment (the values are based on the Elastic CI Stack scripts, as we just copied them to our deployment) and see if that fixes it. Looking at this, it seems unlikely to be a bug in the agent itself but more likely in the Elastic CI Stack scripts ,so I'll close this issue.

ohookins avatar Nov 12 '21 02:11 ohookins

OK, have tracked down the issue a bit closer. I'm wondering whether this may actually be an agent fault. Here is the last bit of the log from the agent itself:

2021-11-15 00:46:16 INFO   ip-10-10-128-137-1 Process with PID: 4816 finished with Exit Status: 0, Signal: nil
2021-11-15 00:46:16 INFO   ip-10-10-128-137-1 Finished job 4ecac449-dddd-450d-90c5-0eacff494e19
2021-11-15 01:12:30 INFO   ip-10-10-128-137-1 All agents have been idle for 1567 seconds. Disconnecting...
2021-11-15 01:12:30 INFO   ip-10-10-128-137-1 Disconnecting...

Here's the log from systemd:

Nov 15 01:12:31 ip-10-10-128-137 systemd[1]: buildkite-agent.service: Control process exited, code=exited status=1
Nov 15 01:12:31 ip-10-10-128-137 systemd[1]: buildkite-agent.service: Failed with result 'exit-code'.

So it actually does seem like the agent inconsistently exits with the correct exit code. Since the systemd unit is configured to restart on error, it restarts the agent but then the instance terminates which causes the problem of dropping jobs that have been started.

I'll enable debug in the agent and see if I get any further context on this.

ohookins avatar Nov 15 '21 02:11 ohookins

Sadly enabling debug doesn't shed any more light on the issue.

2021-11-15 03:33:28 INFO   ip-10-10-130-161-1 All agents have been idle for 1593 seconds. Disconnecting...
2021-11-15 03:33:28 INFO   ip-10-10-130-161-1 Disconnecting...
2021-11-15 03:33:28 DEBUG  POST https://agent.buildkite.com/v3/disconnect
2021-11-15 03:33:28 DEBUG  ip-10-10-130-161-1 Stopping heartbeats
2021-11-15 03:33:28 DEBUG  ↳ POST https://agent.buildkite.com/v3/disconnect proto=HTTP/1.1 status=200 Δ=49.118495ms

Still get an exit code of 1 after this.

ohookins avatar Nov 15 '21 04:11 ohookins

It's weird, there really only appears to be one place where this would exit with 1 afaict: https://github.com/buildkite/agent/blob/848a053b3131ebced93d829c3215b31bddd19f72/main.go#L123-L126 Like there's other places it would due to malformed command or config, but that the agent runs means this is the code path where the exit code comes from. And then only if there was some error and it immediately logs that.

I half remember seeing it before where logging doesn't appear right before an os.exit as that kills the process before it can be written to stdout. Might be as easy as adding a pause or flush here.

liath avatar Nov 17 '21 19:11 liath

We're also experiencing this issue and have been seeing it alot more often in the past couple of months. Is there some debugging information we can give Buildkite to help address this issue?

alexdeng-mp avatar Feb 02 '23 00:02 alexdeng-mp

We have used cloudformation stacks to deploy agents on aws with 100% on demand. facing same agent lost issue

farhansaed avatar Feb 02 '23 00:02 farhansaed

this issue is over a year and nobody from Buildkite is looking into it?

janquijano avatar Feb 17 '23 05:02 janquijano

hey friends! sorry we've taken so long to get back on this - it's one of those ones that got lost in the churn a little bit.

this issue is really difficult to debug because of its nature as something intermittent, and because it's an issue with shutdown - there's not much state left to investigate after it's happened, after all! unfortunately, as it is, there's probably not enough information here for us to investigate deeply, or to issue a fix

we've done a little bit of looking into things from our end though, and we've come up with a little bit more information. To @liath's comment:

It's weird, there really only appears to be one place where this would exit with 1 afaict: https://github.com/buildkite/agent/blob/848a053b3131ebced93d829c3215b31bddd19f72/main.go#L123-L126

i don't think this is 100% true - my suspicion is that this 1 exit code is coming from a call to log.Fatal, which calls exitFunc (which in this case is just os.Exit) with exit status 1.

given that the agent logs seem to say that they're disconnecting, it seems likely to me that something in this process is going sideways, and causing the agent to shut down before it's properly disconnected - perhaps there's a timeout being hit or something.

one other thing to note is that we fairly recently added retries to the disconnect process, and made the disconnect process print Disconnected. when it's done disconnecting. If this line isn't present in your logs and you're running a relatively recent version of the agent, it likely means that something is going wrong with the disconnection, potentially a serverside issue on our part. if that turns out to be the case, let me know and i can raise it with the relevant people internally, but note that disconnection is a best-effort process, and we might not be able to get a 100% success rate on this API call.

i'm sorry that we can't give a more definitive resolution (or a resolution at all) to this issue, but we'd be more than happy to accept PRs making this issue easier to debug.

<3 benno

moskyb avatar Mar 01 '23 02:03 moskyb