elastic-ci-stack-for-aws icon indicating copy to clipboard operation
elastic-ci-stack-for-aws copied to clipboard

Add Warm Pool

Open nitrocode opened this issue 4 years ago • 7 comments

Attempts to close https://github.com/buildkite/elastic-ci-stack-for-aws/issues/822

nitrocode avatar May 04 '21 11:05 nitrocode

cc: @yob @chloeruka

nitrocode avatar May 13 '21 14:05 nitrocode

Hi @nitrocode, thank you for opening this pull request! I agree this could be a valuable feature to help subtract any latency associated with our EC2 UserData and the BootstrapScriptUrl from ASG scale out time.

I’ve started looking at how the warm pool will behave and interact with the rest of our stack. One concern I have is how to prevent warm pool instances from starting their buildkite-agent and pulling work, only for the job to be interrupted by the instance being stopped by the ASG. I’ve looked at the ASG events around instances moving in and out of the warm pool but haven’t been able to think up a reliable way to stall agent start up. What are your thoughts on this?

keithduncan avatar May 31 '21 01:05 keithduncan

I was hoping there were warm pool specific lifecycle events but there are only instance launching and instance terminating events to hook into.

Perhaps there is a way to detect if an ec2 is part of the warm pool and if so skip starting the agent.

Edit: the warm pool does have a different lifecycle event called Warmed:Pending:Wait which could be used to NOT trigger the start of the agent. Or perhaps the Pending:Wait lifecycle event, which is available in warm pool and standard pool lifecycles, could be used to start the agent.

nitrocode avatar May 31 '21 12:05 nitrocode

Interestingly the limitations section of Warm pools for Amazon EC2 Auto Scaling calls out ECS and EKS managed node pools as having a similar issue:

If you try using warm pools with Amazon Elastic Container Service (Amazon ECS) or Elastic Kubernetes Service (Amazon EKS) managed node groups, there is a chance that these services will schedule jobs on an instance before it reaches the warm pool.

The best idea for managing the systemd unit I’ve had so far is to receive those events in a Lambda and lean on the SSM agent to execute the state change on the host. Though I’d still be concerned about the prevalence of race conditions in that set up 🤔

Maybe the best approach here is to ask AWS for guidance on how to warm pool a workload like the buildkite-agent that doesn’t use a load balancer?

keithduncan avatar Jun 01 '21 04:06 keithduncan

Are there plans to add this feature to the next release?

josh-ross-ai avatar Jul 15 '21 01:07 josh-ross-ai

Hi @joshross12 likewise this feature isn’t slated for a particular release and there are some technical hurdles to over come before we can land it.

Specifically here, my plan is to experiment with using the SSM Agent to manage the status of the buildkite-agent systemd service, whether it should be running or not, and would welcome suggestions for how that would look and how to ensure the process is reliable.

keithduncan avatar Jul 15 '21 02:07 keithduncan

Cross-posting this information https://github.com/buildkite/elastic-ci-stack-for-aws/issues/822#issuecomment-964356545 We have to remove MixedInstancesPolicy from ASG config to allow enable WarmPool for that ASG

dieend avatar Nov 09 '21 17:11 dieend