containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[EKS] [ManagedNodeGroup]: Ability to speed up the scale down phase of Managed node update process.

Open nsb413 opened this issue 3 years ago • 23 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request Opened this issue on behalf of a customer.

Which service(s) is this request for? EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We have quite a few large managed node groups with 100+ nodes in each and we are trying to upgrade AMI of the managed node groups. UpdateConfig seems to help with scale up phase, and the upgrade phase. However Scale down phase seems to be slow as it does one node at a time.

The scale down phase decrements the Auto Scaling group maximum size and desired size by one to return to values before the update started.

Wondering if there is way or an option to speed up the scale down phase.

Are you currently working around this issue? NA

Additional context We are trying to upgrade AMI of the managed node groups to address Log4j vulnerability.

nsb413 avatar Jan 04 '22 18:01 nsb413

Even for managed node group with just 1 node, the process scale up to half a dozen of nodes and take long time to scale down. Consider a Fargate node normally takes < 90s to create, this behavior is unbelievable. The high number of excessive nodes cost, too.

Hopefully this can be improve sooner and not later.

tanvp112 avatar Jun 19 '22 09:06 tanvp112

My case is not exactly the same, but for some reason applying a change to a node group with even only 1 node, it takes too long!! In my case, I'm using CDK, and whenever applying a change to the node group, it starts scaling up new nodes and it would take 30 minutes minimum!!! Even for a simple change!! I sometimes destroy the whole nodeGroup stack and redeploy it.

rshad avatar Sep 22 '22 17:09 rshad

Managed node group with 0 node (desired size 0) , 1 or 2 , all take 18 minutes. I can understand those with pods taking some time to create/destroy but even with 0 size taking 18 minutes is too long. I am happy I moved to karpenter.sh for all 60+ pods I use, cannot imagine the launch template /AMI upgrade process with more nodes but I need at least 1 managed node group to put karpenter pods in.

gorkemgoknar avatar Feb 23 '23 08:02 gorkemgoknar

One of my customer is still facing this issue while upgrading. One nodegroup has around 18 nodes and it takes 2hrs approx to update. Please expedite this feature and provide a solution.

singhnix avatar Aug 03 '23 05:08 singhnix

any new here? I also hit this problem with eks v1.26

mkuendig avatar Aug 25 '23 17:08 mkuendig

me to,eks v1.27

image

vangie avatar Sep 06 '23 16:09 vangie

Also hit this issue with EKS 1.27 modifying a node group with only 1 node. Has been running for 25+ minutes, and has spun up several new nodes.

BrianLovelace128 avatar Sep 25 '23 23:09 BrianLovelace128

Any news here, I tried to upgrade 1.27 to 1.28 and stuck, It upgraded only 1 host and didn't create 2nd node group, didn't upgrade 2 currently running c5.9xlarge hosts.

alter avatar Sep 28 '23 07:09 alter

This takes over an hour sometimes for nodegroups with a desired size of 1.

tryan225 avatar Oct 31 '23 18:10 tryan225

Same issue here. NodeGroup with 1 node took 50min to complete update, and spawned 3 more nodes in the process (1 -> 2 -> 3 -> 4 -> 3 -> 2 -> 1). Nodes were Ready for 8min at least between each spawn, and NodeGroup was still updating. Can't wait to update the 20 nodes NodeGroup...

NeodymiumFerBore avatar Nov 10 '23 21:11 NeodymiumFerBore

this is so slow terraform times out, what is the problem?

danpilch avatar Dec 14 '23 14:12 danpilch

I did a bunch of testing around this and found that reducing the amount of availability zones for the nodegroup makes a significant difference in how quickly this occurs:

1 and 2 az: single node updates took 11~ minutes ten node updates took 16~ minutes

3az: single node updates took 16~ minutes ten node updates took 30~ minutes

4az: single node updates took 50 minutes didn't bother with 10 nodes

nwsparks avatar Dec 14 '23 15:12 nwsparks

Any update on this? Quickly trying out something takes days!

amalic avatar Feb 11 '24 09:02 amalic

I have an EKS 1.26 managed node group with 1 m5.large instance that takes 20 minutes to replace.

module.eks_nonprod.module.eks.module.eks_managed_node_group["default1"].aws_eks_node_group.this[0]: Modifications complete after 19m16s [id=eks-cluster-24123:default1]

vuskeedoo avatar Apr 12 '24 03:04 vuskeedoo

This is a major pain for our team. We manage 70-80 EKS clusters and minor changes (e.g. a tag update) take a very long time to complete because of the eks_managed_node_group being so slow to update.

Things we've tried:

  • Terminating all instances manually does not seem to speed things up.
  • Manually updating desired capacity and max capacity on the ASG does seem to help a bit, but it's a major inconvenience to have to do this across all the clusters.
  • Setting the "maximum unavailable" config on the EKS node group to 100% does not seem to speed things up either.

Even after all the instances with the old AMI have been terminated manually, the modifications continue to be in progress for a while (10-15 minutes).

At this rate we're considering going back to self managed node groups.

Update: Switching to Karpenter, which means only needing one small managed node group for Karpenter itself, helps a lot as well.

itsobgraph avatar May 14 '24 19:05 itsobgraph

I have just 4 EKS Clusters. But I'm experiencing this same problem. I read somewhere that this behavior may be related to the ClusterAutoscaler, which cannot scale down properly. I tried some things like set cluster-autoscaler.kubernetes.io/enable-ds-eviction=true annotation, but it didn't work.

Any simple update takes at least 40 minutes.

smthiago avatar May 15 '24 22:05 smthiago

I switched to Karpenter from CA to keep managed ASG as small as possible. It's also possible to get rid of ASG completely by running karpenter on fargate instances. Still quite stupid to spend 10-15m to update 3 small nodes, but it's significantly faster than before.

makarov-roman avatar May 16 '24 08:05 makarov-roman

Same here, took 40+ minutes for migrating from 1.29 to 1.30 on 4 nodes. I really want to come back to GKE,

figaro-smartotum avatar Jul 03 '24 08:07 figaro-smartotum

we having same issue here, it take terraform more than 70+ mins to upgrade from 1.28 to 1.29, and the process is not transparent at all, there basically no log you can referring to during the whole time... it operate on its own pace as a black hole.

leoweiyu avatar Jul 08 '24 01:07 leoweiyu

I have similar problems and one additional. For some mythical reason, after I update Managed node group launch template with new version (ie, new ami released), it successfully updates node group (yeah, all that 20+ minute process) and then EKS switches launch template BACK automatically to some ancient launch template version. And no explanations given, just switches back and I wait ANOTHER half an hour and face bloody config drift in TF and outdated images running in cluster!

Managed node group implementation is like worst piece of crap there is.

siimaus avatar Aug 16 '24 07:08 siimaus

I would recommend that you all switch to Karpenter. Update there takes a couple of minutes and is a breeze.

mkuendig avatar Aug 16 '24 08:08 mkuendig

Another option is create new MNG with upgraded AMI and then migrate workload to the new MNG. Finally scale down and delete the old MNG. This way is faster and allow you to control the migration pace. Not perfect but work much faster.

gohmc avatar Aug 16 '24 10:08 gohmc

Another option is create new MNG with upgraded AMI and then migrate workload to the new MNG.

This is what I do, too, and it works pretty well.

If you're using Terraform, too, you can set create_before_destroy on the node group to have Terraform fully bring up the replacement node group before it destroys the old one. That obviously doesn't work when you're changing something about the resource that wouldn't be considered a replacement (like changing the resource identifier).

Additionally, you can use a terraform_data resource to track certain parameters of both the node group and the launch template, with a replace_triggered_by to automatically replace the node group when anything changes that would normally cause an update-in-place.

Here's an example of what I mean:

Click to expand
resource "aws_launch_template" "this" {
  ...

  lifecycle {
    create_before_destroy = true
  }
}

# This resource tracks various settings of the eks_node_group that would
# normally cause the node group to be rotated in-place. However, for larger
# clusters, in-place rotations can take a long time and sometimes even fail.
# Instead, we can use `replace_triggered_by` pointing at those settings to
# simply replace the node group entirely (creating the new one before the
# destroying the old one, so the pods can be migrated; the cluster-autoscaler
# will handle scaling up the new node group).
resource "terraform_data" "replacement_trigger" {
  for_each = var.replace_on_update ? {
    # Please keep this map up-to-date with any attributes/arguments (that we
    # control) that would cause the node group to rotate. The node group will be
    # replaced when any of the values of this input change. The keys are strings,
    # to identify the for_each iterators, and the values are the actual value.
    # The values must be all the same type, hence the yamlencode(). This converts
    # them to strings, while still being relatively easy for a human to parse.
    # The values don't REALLY matter here; what matters is that any CHANGE to
    # them triggers a change in this resource. Also, additions or subtractions
    # to the for_each iterators will not cause the nodes to be replaced, only
    # changes to the values of previously-existing instances.
    "k8s_version"             = yamlencode(var.eks_cluster_info.version)
    "launch_template_version" = yamlencode(aws_launch_template.this.default_version)
    "launch_template_id"      = yamlencode(aws_launch_template.this.id)
    "subnet_ids"              = yamlencode(var.subnet_ids)
    "instance_type"           = yamlencode(var.instance_types)

    # When `var.replace_on_update` is set to `false`, we cannot provide
    # an empty map, since the `replace_triggered_by` field in the nodes
    # resource will be pointing to an empty resource; this will cause an error
    # during planning. So, we provide a static "dummy" iterator that will never
    # change.
  } : { "disabled" = true }

  input = each.value
}

resource "aws_eks_node_group" "this" {
  ...

  launch_template {
    name    = aws_launch_template.this.name
    version = aws_launch_template.this.default_version
  }

  lifecycle {
    create_before_destroy = true

    replace_triggered_by = [
      terraform_data.replacement_trigger,
    ]
  }
}

The main caveat with creating and destroying in one operation is that your workloads need to be able to handle quickly migrating from one node to another. When AWS brings down the instances from the old node group, it doesn't fully respect instance life cycles. So this approach should generally only be taken (for production workloads) during maintenance windows. You might encounter brief downtime while your workloads migrate from one node group to the other.

zhimsel avatar Aug 17 '24 12:08 zhimsel