elastic-ci-stack-for-aws icon indicating copy to clipboard operation
elastic-ci-stack-for-aws copied to clipboard

Support a variable number of AvailabilityZones

Open jmchuster opened this issue 5 years ago • 21 comments

If i try to set the value for AvailabilityZones to a single zone, say us-west-2a, it fails with

There was an error creating this change set
Template error: Fn::Select cannot select nonexistent value at index 1

I only need a single availability zone here, because the auto-scale group will try rebalancing instances across zones, killing my long-running jobs (24 hours +).

jmchuster avatar Aug 20 '20 23:08 jmchuster

Oh, weird, sorry! Sounds like a bug. Looks like we presume two availability zones in which to create subnets:

https://github.com/buildkite/elastic-ci-stack-for-aws/blob/a89a3f5c95dafd94030feaf93c9150987345a272/templates/aws-stack.yml#L546-L574

What happens if you specify us-west-2a,us-west-2a? Is that a reasonable workaround?

Another possible workaround, could you suspend the AZRebalance scaling process? This wouldn't stick around between updates, though.

Otherwise, sounds like a bug to be fixed in the CloudFormation, somehow.

sj26 avatar Aug 21 '20 03:08 sj26

Otherwise, sounds like a bug to be fixed in the CloudFormation, somehow.

Unfortunately it can't 😢 There is no way to vary CloudFormation templates based on the length of input. It's extremely frustrating.

lox avatar Aug 21 '20 03:08 lox

@lox can we list a few and make them conditional on the presence of the az?

sj26 avatar Aug 21 '20 04:08 sj26

This is how I approached it in an experiment @sj26:

https://github.com/buildkite/elastic-ci-stack-for-aws-ecs/blob/2cf72edd0ba9e40335f58b48320ea94d94e3c464/templates/vpc/template.yaml#L9 https://github.com/buildkite/elastic-ci-stack-for-aws-ecs/blob/2cf72edd0ba9e40335f58b48320ea94d94e3c464/templates/vpc/template.yaml#L56-L66

I've been petitioning AWS for Fn::Size for 3+ years 😢

lox avatar Aug 21 '20 04:08 lox

... oh my

sj26 avatar Aug 21 '20 04:08 sj26

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/intrinsic-function-reference-select.html ... you should be certain that the index you choose is valid ... How?!

sj26 avatar Aug 21 '20 05:08 sj26

I think I'm going too deep https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/intrinsic-function-reference-transform.html https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-macros.html#template-macros-lambda-interface

... I think we'll raise a ticket with AWS and ask them how to do it, which might at least add to "petitioning AWS for Fn::Size"

sj26 avatar Aug 21 '20 05:08 sj26

Tried setting the availability zones to us-west-2a,us-west-2a, the subnet update failed with

CIDR Block must change if Availability Zone is changed and VPC ID is not changed

jmchuster avatar Aug 21 '20 18:08 jmchuster

Sounds like the underlying issue is the rebalancing across AZs, correct? @sj26 i believe there was some internal discussion around using a lambda to update that setting on the ALB? It’s not exposed via Cloudformation.

lox avatar Aug 21 '20 22:08 lox

It sounds the answer is "use a lambda". AWS said:

Having said that, In order to iterate over a list of "AvailabilityZones", and produce a resource for each element in the list, You can use a Lambda function to create a macro to transform your template based on the list of "AvailabilityZones" provided by your customer (preferred), or build your template dynamically - there are several libraries that can help you do this, like troposphere and the AWS CDK.

Or use a lambda to do as @lox suggests, to suspend AZRebalance.

sj26 avatar Aug 24 '20 08:08 sj26

@jmchuster ah, that sounds like an error from an update. Could you try destroying the stack entirely and re-creating it? I think that workaround is the easiest and most promising.

sj26 avatar Aug 24 '20 08:08 sj26

Another alternative workaround would be to run a custom version of the stack, editing the YAML to remove the second subnet.

sj26 avatar Aug 24 '20 08:08 sj26

If none of the workarounds works then someone industrious might like to create a cloudformation macro for our stack like this article:

https://medium.com/pablo-perez/dynamic-resource-generation-of-resources-using-cloudformation-macros-f6baba75d730

It could look at the list of AvailabilityZones and create an appropriate number of appropriately configured Subnet resources.

I'm hesitant to do this because of the added complexity and maintenance burden, but it might be our only option to do it "properly".

sj26 avatar Aug 24 '20 08:08 sj26

Another suggested workaround from AWS:

As a workaround, you could write your template to support a maximum number of Availability zones (2 in this case), and then have each Availability zone passed in as a separate parameter. If a customer wants to use only 1 Availability zone, you could instruct the customer to leave the second one blank. Then, you could use conditional statements to make the second Availability zone optional, so that it's only created if a second Availability zone is passed in.

However, this would result in a pretty ugly template with redundant information. And does not allow for more Availability zones than you have set as the maximum, so if a customer wants to use 3 Availability zones and your template only allows for 2, the 3rd one would not be created.

I don't love it, but I don't hate it either.

sj26 avatar Aug 25 '20 01:08 sj26

I think I prefer the approach I took in that linked CFN template once you mix in things like public vs private subnets and NAT Gateway redundancy.

lox avatar Aug 25 '20 02:08 lox

Yeah I think I do too, @lox.

Not sure when we'll get to this, but we should roll something like that in.

Meanwhile, @jmchuster, did you try re-creating your stack with the same az twice? Does it work?

sj26 avatar Aug 25 '20 02:08 sj26

So i ended up just modifying the template to have a param to choose 1 or 2 AvailabilityZones.

  AvailabilityZoneCount:
    Type: Number
    Description: Optional - Number of subnets that are created from AZs (chosen from !GetAZs or AvailabilityZones parameter if provided)
    Default: 2
    MinValue: 1
    MaxValue: 2

so added the conditions

    CreateSubnet0:
      !And
        - { Condition: CreateVpcResources }
        - !Or
          - !Equals [ 1, !Ref AvailabilityZoneCount ]
          - !Equals [ 2, !Ref AvailabilityZoneCount ]

    CreateSubnet1:
      !And
        - { Condition: CreateVpcResources }
        - !Equals [ 2, !Ref AvailabilityZoneCount ]

swapped them out as the Condition for Subnet[01] and Subnet[01]Routes

and then for the AgentAutoScaleGroup

      VPCZoneIdentifier: !If
        - CreateVpcResources
        - !If
          - CreateSubnet1
          - [ !Ref Subnet0, !Ref Subnet1 ]
          - [ !Ref Subnet0 ]
        - !Ref Subnets

So, presumably, you could add options for 3 or 4 AZs as well and keep on nesting !Ifs following the same pattern, up to AWS's practical limit of 5.

I had looked into maybe setting the SuspendProcesses based on a param, but it looks like AWS's intention is that it be used only as a temporary state, rather than as a property of the auto-scaling group, seeing as they only provide it under UpdatePolicy as something you set and then automatically unset as part of an update. I'm sure there's an argument to be made that this thinking makes sense for Terminate but less so for AZRebalance.

jmchuster avatar Aug 25 '20 05:08 jmchuster

Supporting different AZ (and public/private) configurations is still something we're keen to support at some point. In the interim, the new v5.0.0 release disables AZRebalancing which should address at least one of the reasons why single AZ deployments are sometimes desirable.

yob avatar Oct 27 '20 10:10 yob

Another note here: I believe it might be possible to run in a single AZ (or 3+) by creating the VPC and subnets manually (or via terraform, or your preferred AWS tools), then passing the subnet(s) IDs into the elastic stack via the Subnets stack parameter.

yob avatar Jan 02 '21 04:01 yob

I found an AWS VPC template that takes the same approach as https://github.com/buildkite/elastic-ci-stack-for-aws/issues/700#issuecomment-679705989 and exposes a NumberOfAZs parameter along with a plethora of conditions.

keithduncan avatar Oct 07 '21 08:10 keithduncan

Weird as that approach is, I’ve warmed to it as a pragmatic answer to the problem that doesn’t involve using Lambdas to rewrite the template live 🙈

I reckon support for 2, 3, and 4 AZs to bring parity with that Quick Start would be good.

A nice customer facing benefit in supporting more AZs is to broaden the number of spot pools that are available to the Auto Scaling group. More pools, better chance of spot allocation, less chance of spot interruption.

keithduncan avatar Oct 14 '21 01:10 keithduncan