elastic-ci-stack-for-aws icon indicating copy to clipboard operation
elastic-ci-stack-for-aws copied to clipboard

Max instance lifetime

Open nitrocode opened this issue 3 years ago • 5 comments

Related to PR https://github.com/buildkite/elastic-ci-stack-for-aws/pull/836

If this number is set to 86400 or greater, the instance will terminated.

nitrocode avatar May 06 '21 03:05 nitrocode

Taking a quote from #836:

Isn't there a lifecycle hook in the stack that allows jobs to finish before continuing the termination ? If so, I believe the instance refresh respects this... but perhaps there isn't a hook. I don't see one defined in the cf stack.

In v4 of the stack and earlier there was a lifecycle hook to help make instance shutdown more graceful and give the agents time to finish jobs. It was essential in those versions, as we used AWS driven autoscaling, and any instance (even those running jobs) could be marked for termination if the ASG decided it wanted to scale down.

In v5 we fully switched to the current method of scale down, where the agent self terminates the instance after an idle period. We removed the lifecycle hooks, with the understanding that they weren't needed any more.

It's possible that we may want to add them back though, as there are still scenarios where AWS can decide to terminate an instance and the instance could be running job(s). One possibility is AZ rebalancing (which we worked a round in #751). Another could be introducing instance lifetimes.

Without lifecycle hooks, I think this PR is highly likely to interrupt running jobs. Sorry!

I think it's a neat feature though, and think we have customers that would use it.

yob avatar May 11 '21 14:05 yob

I would love to use this at Robinhood. Right now I have to set MinSize to 0 and hope that people don't use our agents all night to force us to recycle agents, but something like this would be really useful for us.

ptarjan avatar Aug 26 '21 22:08 ptarjan

@nitrocode there's a merge conflict, would you mind updating your branch?

jessebye avatar Sep 14 '21 20:09 jessebye

@keithduncan @yob anything blocking this? Just curious if this needs changes or is good to ship.

Without lifecycle hooks, I think this PR is highly likely to interrupt running jobs. Sorry!

Couldn't this still be added? The default behavior would be the same, and if a user enabled this, they are doing so at their own risk...

jessebye avatar Sep 23 '21 22:09 jessebye

anything blocking this?

Better support for ASG initiated instance termination is the blocker I’m afraid.

Couldn't this still be added? The default behavior would be the same, and if a user enabled this, they are doing so at their own risk...

They could! And they are the approach I’m expecting us to use for this.

Unfortunately I’m not comfortable merging this without a reliable way to handle ASG initiated instance termination. The absence of this today already causes strife so I’m very hesitant to add another source of it. For those who are comfortable with the risk, it’s possible to apply a forked copy of this template to your stacks.

I’d also be thrilled to collaborate on a Lambda + SSM based lifecycle hooks implementation for the ASG events 😄 I believe it could solve several problems at once (e.g. #927 #838) if proven to be reliable, and enable features like agent instance quarantine/cordoning.

keithduncan avatar Sep 23 '21 22:09 keithduncan