elastic-ci-stack-for-aws New way to configure stacks to use spot instances when possible, but fallback to OnDemand if necessary

New way to configure stacks to use spot instances when possible, but fallback to OnDemand if necessary

Open yob opened this issue 3 years ago • 8 comments

In #710 (released in 5.1.0) we made a few changes to how spot prices and instance types are configured:

deprecated the SpotPrice param
added the OnDemandPercentage param, which sets the instances that should be OnDemand
allow InstanceType to specify up to 4 types (instead of 1)

The intention of was to get "always use a minimum of x% OnDemand, and for the rest use Spot when you can, but fallback to OnDemand if spot isn't available".

It was hard to test though, as we can't predict when Spot instances will be unavailable.

Since shipping 5.1.0 and seeing OnDemandPercentage in use, it seems what we actually have is "always use a minimum of x% OnDemand, and for the rest use Spot when you can, and when Spot isn't available don't fallback to OnDemand - just keep trying to launch Spot until it works".

Some price sensitive users of the stack might prefer the current behaviour, but I suspect many users would prefer to use Spot when available and OnDemand if they must. Engineers sitting around waiting for a build to complete might be more expensive than paying for a few OnDemand instances.

Would we need a new stack parameter for this?

Jun 02 '21 23:06 yob

We had an issue crop up a couple of times recently where our region just didn't have spot instances available, so our jobs hung out for 10+ minutes waiting. This wasn't a case of getting outbid, there just weren't instance types at our desired size available. Given that spot is usually perfect for us (none of our jobs take longer than 5-6 minutes), being able to fall back to on demand if spot isn't there would be really useful.

Jun 04 '21 20:06 geoffharcourt

@geoffharcourt I’ve started looking at this use case, could you share some details on your current auto scaling configuration?

I’m looking for details on the instance types and and MaxSize of your auto scale group, what split of OnDemand vs Spot you have chosen, how long you usually wait for your jobs to start, and how long you’d be prepared to wait before incurring the additional cost of an on-demand instance. Thanks!

Jun 08 '21 05:06 keithduncan

Hi @keithduncan, thanks for looking into this!

Our setup was:

100% Spot 0-150 instances m5.large

Normally during the work day our wait is negligible for jobs to run, but if there are no agents and we're under the instance cap it can take a minute or so for instances to get going. This issue where instances are simply not available (the ASG event message was not that we had been outbid, but rather that there were no instances of the type available at all).

We've only seen the scenario where this happens three times, but it's all been in the past 3 weeks. In every occurrence we responded by adding a second instance type, which was not ideal because meant that when the scarcity had subsided the ASG was almost always picking the non-m5.large instance type which was either slower or more expensive.

For us, if we had gone 2 minutes and been unable to provision I think we'd be OK going to on-demand. If there was a way to prioritize spot instances by type that would fix this issue for us so we could prefer m5.large but fall back to something else like m5a.large without ending up almost always running that due to the way ASGs pick by available capacity.

Jun 08 '21 09:06 geoffharcourt

Thanks @geoffharcourt this is really useful to know.

100% Spot 0-150 instances m5.large

Do you run more than 1 AgentsPerInstance on these instances?

(the ASG event message was not that we had been outbid, but rather that there were no instances of the type available at all).

How many availability zones did your spot request cover? Is using spot instances in a different region a valid way to handle this for your use case?

If there was a way to prioritize spot instances by type that would fix this issue for us

There is but not exposed through our parameters, if you’re comfortable running a template fork you can change SpotAllocationStrategy to capacity-optimized-prioritized which is documented to take the instance priority into consideration though does still optimise for capacity first.

Jun 08 '21 20:06 keithduncan

Hi @keithduncan we have 1 agent per instance. We're open to using bigger instances and multiple agents per instance, but that's not something we've tried in the past.

We have three availability ones specified (all three in our region, us-east-2). We could switch regions if that was necessary. There's nothing region-dependent about our CI setup, so we could run in whatever region was most cost-effective and reliable.

Jun 08 '21 21:06 geoffharcourt

We had an issue crop up a couple of times recently where our region just didn't have spot instances available, so our jobs hung out for 10+ minutes waiting. This wasn't a case of getting outbid, there just weren't instance types at our desired size available. Given that spot is usually perfect for us (none of our jobs take longer than 5-6 minutes), being able to fall back to on demand if spot isn't there would be really useful.

I'd like to also add a +1 to this request. We have a couple different ASGs that were initially set up with 100% spot requests in one region across AZs, but due to some occasional lack of availability of spot instances where we had to manually adjust the ASG's allocation percentage, we switched to 100% on-demand to be safest.

Jan 24 '22 20:01 tremaineeto

After upgrading past 5.1 we've had several long waits for spot capacity. https://github.com/buildkite/elastic-ci-stack-for-aws/issues/700 might help somewhat, but e.g. 'detect that spot instances have not launched after a minute and switch to ondemand for X hours' would solve it more reliably.

Mar 08 '22 21:03 DanielHeath

On 5.9.0 I am seeing annoying cases of builds waiting for agents for unbounded times as the auto-scale simply gives up when spot agents are not available:

aws console shows repeated errors that are not causing the auto-scale logic to fallback to OnDemand:

Launching a new EC2 instance. Status Reason: Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed.

Is this the intended behavior? If so, low values of OnDemandPercentage are effectively unusable...

Jun 10 '22 19:06 huguesb

elastic-ci-stack-for-aws elastic-ci-stack-for-aws copied to clipboard

New way to configure stacks to use spot instances when possible, but fallback to OnDemand if necessary

elastic-ci-stack-for-aws
elastic-ci-stack-for-aws copied to clipboard