terraform-aws-fck-nat icon indicating copy to clipboard operation
terraform-aws-fck-nat copied to clipboard

Improve spot HA by utilising ASG capacity rebalance

Open kieranbrown opened this issue 1 year ago • 2 comments
trafficstars

Capacity rebalance helps by being proactive in trying to replace Spot Instances before they are interrupted. Full docs - https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-capacity-rebalancing.html


Current Behaviour When a spot instance receives its 2-minute interruption warning nothing happens, the instance is terminated after 2 minutes then a new instance is started after the original is terminated.

New Behaviour When a spot instance receives its 2-minute interruption warning the ASG immediately provisions a new instance which hopefully will boot and move the floating EIP before the original is terminated. With this approach, there is minimal downtime when spot instances are terminated.


You can test this out using AWS Fault Injection, if you go to the EC2 management console then click Spot Requests in the sidebar. You have the option to select a spot request, click actions then Initiate Interruption

kieranbrown avatar Jan 23 '24 12:01 kieranbrown

The only thing I would change is the ha_additional_instance_types variable, would it be better to use an empty array by default? Otherwise, after an update, users would have an ASG with a behavior and an instance that they did not choose.

This pull request should fix the problems I'm having while using fck-nat. My spot instance often fails to start because there is no spot capacity available. For example these are the errors that often happen to me when using the fck-nat instance:

2024-05-21 13 20 52

gabrieleolmi avatar May 21 '24 16:05 gabrieleolmi

@GabrieleOlmi

The only thing I would change is the ha_additional_instance_types variable, would it be better to use an empty array by default? Otherwise, after an update, users would have an ASG with a behavior and an instance that they did not choose.

It's been a while since I last looked at this but IIRC capacity rebalance requires a mixed instance policy and within a mixed instance policy you need to define a minimum of 2 instance types. Adding the ha_additional_instance_types and defaulting it to the next cheapest instance type was the only sensible approach I could think of.

Defaulting ha_additional_instance_types to an empty array would cause an error if end users set use_spot_instances = true without explicitly setting ha_additional_instance_types to their preferred failover instance.

Perhaps just some documentation to clear up this behaviour in the README would be enough.

@RaJiska it would be good to hear your thoughts on this.

kieranbrown avatar May 22 '24 21:05 kieranbrown