elastic-ci-stack-for-aws icon indicating copy to clipboard operation
elastic-ci-stack-for-aws copied to clipboard

`NewInstancesProtectedFromScaleIn` causing ASGs to take ages to update.

Open toothbrush opened this issue 2 years ago • 6 comments

Good day! 👋

Describe the bug

We use realestate's stackup to manage rollouts of the aws-stack.yml you provide. Mostly works great. In https://github.com/buildkite/elastic-ci-stack-for-aws/commit/178c253b81e1aca6647ff10c8d49fb73bf6d8cbe you have enabled NewInstancesProtectedFromScaleIn: true to the ASG, but the behaviour i'm now seeing is that when i make a change to the AWS Elastic stack (e.g. an updated AMI ID), the old ASG takes ages (1h+) to delete/stabilise, since the members are protected from scale-in.

Steps To Reproduce Steps to reproduce the behavior:

  1. Spin up the AWS elastic stack
  2. wait for it to be ready and CREATE_COMPLETE
  3. change a parameter, e.g. AMI ID
  4. a new LaunchTemplate will be created and the old one will attempt to delete (along with its instances) but will hang for... long.

Expected behavior

Previously, updates were pretty snappy because the old ASG members would just be terminated.

Stack parameters (please complete the following information):

  • AWS Region: us-east-1
  • Version: v5.7.2

toothbrush avatar Dec 22 '21 00:12 toothbrush

Ah just looking around, maybe i'm in fact being re-bitten by https://github.com/buildkite/elastic-ci-stack-for-aws/issues/927?

In any case i'm not convinced that the instance protection thing is good for us.

toothbrush avatar Dec 22 '21 00:12 toothbrush

Hey hey @toothbrush. Thanks for reporting, we'll look into it!

eleanorakh avatar Jan 11 '22 04:01 eleanorakh

Somewhat related is https://github.com/buildkite/elastic-ci-stack-for-aws/pull/768

Also been working to create custom AMIs and update the stack via ImageIdParameter. This causes a new ASG to be created, which is a problem in my case, since the instances in the old ASG will be terminated, which causes in-progress jobs to fail.

freewil avatar Jan 24 '22 15:01 freewil

when i make a change to the AWS Elastic stack (e.g. an updated AMI ID), the old ASG takes ages (1h+) to delete/stabilise, since the members are protected from scale-in.

I've run into this issue myself when i want to scale down a stack rapidly (~400 instances to 0) via manually changing the ASG desired count/min/max values. It hasn't been a major issue for me as this is typically only done when launching new stacks to replace old stacks. I simply go to the instance management tab for the ASG in the AWS console and manually remove scale-in protection to speed up the scale down.

freewil avatar Jan 24 '22 15:01 freewil

I am running into this issue with v5.9.0 of the stack and have had to repeatedly go into the AWS console to manually disable protection for instances in the old stacks to allow the update to complete. This is obnoxious! It makes even the smallest configuration changes a major PITA, especially when some stacks have thousands of instances and AWS only allow removal of scale-in protection in batches of 50...

Please fix this asap.

huguesb avatar Jun 10 '22 19:06 huguesb

We made a hacky but effective fix for this problem by co-opting the AzRebalancingSuspenderFunction to remove scale-in-protection for running instances when the stack is updated or deleted. We're able to do this in our solution because we fork the ElasticCI template for other reasons. This also required changes to the function's role/permission and timeout/duration.

eg:

              client = boto3.client('autoscaling')
              props = event['ResourceProperties']
            
              if event['RequestType'] in ('Delete', 'Update'):
                instances = client.describe_auto_scaling_instances()['AutoScalingInstances']
                instances = [i['InstanceId'] for i in instances if i['AutoScalingGroupName'] == props['AutoScalingGroupName']]
                if instances:
                  response = client.set_instance_protection(InstanceIds=instances, AutoScalingGroupName=props['AutoScalingGroupName'], ProtectedFromScaleIn=False)
              else:
                response = client.suspend_processes(AutoScalingGroupName=props['AutoScalingGroupName'], ScalingProcesses=['AZRebalance'])
    
etc

gitlon avatar Aug 25 '22 04:08 gitlon