aws-node-termination-handler icon indicating copy to clipboard operation
aws-node-termination-handler copied to clipboard

Document EventBridge rules with resource filter based on ASG tags

Open stevehipwell opened this issue 4 years ago • 18 comments

Describe the feature When using NTH in queue mode we need to create EventBridge rules to match our resources. The examples don't have any filters but this isn't going to work in a real world account. I'd like to see documentation for how the rules can be filtered based on ASG tags so we can filter resources from many ASGs with a single rule.

Is the feature request related to a problem? When using resources as a filter the rule reaches it's max size before all ASGs can be monitored.

Describe alternatives you've considered I've created a rule per ASG.

stevehipwell avatar Nov 18 '21 17:11 stevehipwell

@bwagner5 the v2 discussions reminded me about this issue.

stevehipwell avatar Nov 18 '21 17:11 stevehipwell

I do not believe it is currently possible to specify an EventBridge ASG source by tag, only ASG names. ASG prefix may not be optimal depending on how the infra is setup, but it's at least better than individual names: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns-content-based-filtering.html#eb-filtering-prefix-matching

bwagner5 avatar Nov 18 '21 23:11 bwagner5

Do you know if it's possible to use a wildcard in the name?

stevehipwell avatar Nov 19 '21 06:11 stevehipwell

I don't think so. I believe it's only prefix for strings.

bwagner5 avatar Nov 19 '21 15:11 bwagner5

Does the ASG name come through in the event? All our ASGs for a cluster have the same prefix.

stevehipwell avatar Nov 19 '21 15:11 stevehipwell

yep, ASG Name is there https://docs.aws.amazon.com/autoscaling/ec2/userguide/cloud-watch-events.html#terminate-lifecycle-action

bwagner5 avatar Nov 19 '21 15:11 bwagner5

What about for spot terminations and rebalance events?

stevehipwell avatar Nov 19 '21 16:11 stevehipwell

@bwagner5 I'm pretty sure I've got the rules wrong for my spot notifications, as I've got the ASG ARN as a resource filter based on the event patterns in the doc you linked above. Does NTH check if the node is in K8s before it evaluates the ASG tag or does it check the tag first? Basically should the tag be unique to the cluster if I might have multiple clusters in an account?

stevehipwell avatar Nov 19 '21 16:11 stevehipwell

Spot Termination and Rebalance events do not have the ASG in them, only ASG event do. Spot and Rebalance events work outside of ASG which is why they don't have ASG context associated with them.

This is a good discussion that we need to update docs on!

If you are using ASG w/ capacity-rebalance enabled on the ASG, then you do not need Spot and Rebalance events enabled w/ EventBridge.

ASG will send a termination lifecycle hook for spot interruptions while it's launching a new instance.

ASG will send a termination lifecycle hook for rebalance events after it brings up a new node in the ASG.

If you do not have capacity-rebalance enabled on the ASG, then spot interruptions will cause a termination lifecycle hook as the interruption comes in, not while it's launching the new instance.

bwagner5 avatar Nov 19 '21 16:11 bwagner5

@bwagner5 could you give me an example of what the infrastructure should look like using capacity-rebalance?

I'd also be interested in the optimal way to configure ASG spot pools and options for EKS? Basically lowest-price vs capacity-optimized and the number of pools to configure per instance type. I've currently left this on the defaults (for the terraform-aws-eks TF module) but am happy to change this to work better with capacity-rebalance.

I also assume that in a scenario using spot termination where there are multiple clusters in an account region (region can be filtered in the rule) that the ASG tag needs to be unique to the cluster?

This is my refactored configuration in Terraform, using instance refresh and spot terminations.

resource "aws_autoscaling_lifecycle_hook" "default" {
  count = length(local.asg_ids)

  name                   = "aws-node-termination-handler"
  autoscaling_group_name = local.asg_ids[count.index]
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  heartbeat_timeout      = 600
  default_result         = "CONTINUE"
}

resource "aws_cloudwatch_event_rule" "asg" {
  name = "${var.cluster_name}-asg-termination"

  event_pattern = jsonencode(
    {
      "source" : [
        "aws.autoscaling"
      ]
      "detail-type" : [
        "EC2 Instance-terminate Lifecycle Action"
      ],
      "region" : [var.region]
      "detail" : {
        "AutoScalingGroupName" : [{ "prefix" : var.cluster_name }]
      }
    }
  )

  tags = var.tags
}

resource "aws_cloudwatch_event_target" "asg" {
  target_id = "1"
  rule      = aws_cloudwatch_event_rule.asg.name
  arn       = aws_sqs_queue.default.arn
}

resource "aws_cloudwatch_event_rule" "spot" {
  name = "${var.cluster_name}-spot-termination"

  event_pattern = jsonencode(
    {
      "source" : [
        "aws.ec2"
      ]
      "detail-type" : [
        "EC2 Spot Instance Interruption Warning"
      ]
      "region" : [var.region]
    }
  )

  tags = var.tags
}

resource "aws_cloudwatch_event_target" "spot" {
  target_id = "1"
  rule      = aws_cloudwatch_event_rule.spot.name
  arn       = aws_sqs_queue.default.arn
}

stevehipwell avatar Nov 22 '21 09:11 stevehipwell

could you give me an example of what the infrastructure should look like using capacity-rebalance?

I'd also be interested in the optimal way to configure ASG spot pools and options for EKS? Basically lowest-price vs capacity-optimized and the number of pools to configure per instance type. I've currently left this on the defaults (for the terraform-aws-eks TF module) but am happy to change this to work better with capacity-rebalance.

If you're using capacity-rebalance on an ASG, then you should never use the lowest-price allocation strategy, always capacity-optimized. Using lowest-price w/ capacity-rebalance can cause a lot of churn.

When using cluster-autoscaler, you'll need each of your ASGs to be a similar instance shape and increase the number of ASGs you operate with (Karpenter doesn't suffer from this limitation :) ). We recommend to provide as many instance pools as you can that match a similar shape. We have a tool that helps: https://github.com/aws/amazon-ec2-instance-selector . Generally, 3-4 pools is pretty good though.

I also assume that in a scenario using spot termination where there are multiple clusters in an account region (region can be filtered in the rule) that the ASG tag needs to be unique to the cluster?

Yes, that is correct.

bwagner5 avatar Nov 22 '21 15:11 bwagner5

@bwagner5 we have our ASGs in a good shape, the question is if we should be using a default of 10 pools or if we should be using a pool per instance type available to the ASG?

I think a switch to capacity-rebalance with capacity-optimized placement makes sense for us, especially if it means we can just watch the termination events to deal with spot instances being replaced. I take it that this would allow us to have longer than the default 120 seconds to deal with termination events?

stevehipwell avatar Nov 22 '21 16:11 stevehipwell

@bwagner5 never mind it looks like we need to set the pools value to 0 for capacity-optimized placement, not that the docs were any use as they give a very generic definition for pools and then fail to mention them again other than in circular references back to the original sparse definition.

stevehipwell avatar Nov 22 '21 17:11 stevehipwell

I take it that this would allow us to have longer than the default 120 seconds to deal with termination events?

@bwagner5 any advice on this? I'm not sure if it's related but we saw a node fail to terminate correctly which then resulted in a CSI driver failure to unmount/mount.

stevehipwell avatar Dec 14 '21 08:12 stevehipwell

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

github-actions[bot] avatar Jan 13 '22 17:01 github-actions[bot]

/not-stale

stevehipwell avatar Jan 13 '22 17:01 stevehipwell

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

github-actions[bot] avatar Feb 14 '22 17:02 github-actions[bot]

/not-stale

stevehipwell avatar Feb 14 '22 17:02 stevehipwell

This has been released in v1.20.0, chart version 0.22.0

cjerad avatar Jun 22 '23 15:06 cjerad