containers-roadmap
containers-roadmap copied to clipboard
[EKS] [ManagedNodeGroup]: Ability to speed up the scale down phase of Managed node update process.
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Tell us about your request Opened this issue on behalf of a customer.
Which service(s) is this request for? EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We have quite a few large managed node groups with 100+ nodes in each and we are trying to upgrade AMI of the managed node groups. UpdateConfig seems to help with scale up phase, and the upgrade phase. However Scale down phase seems to be slow as it does one node at a time.
The scale down phase decrements the Auto Scaling group maximum size and desired size by one to return to values before the update started.
Wondering if there is way or an option to speed up the scale down phase.
Are you currently working around this issue? NA
Additional context We are trying to upgrade AMI of the managed node groups to address Log4j vulnerability.
Even for managed node group with just 1 node, the process scale up to half a dozen of nodes and take long time to scale down. Consider a Fargate node normally takes < 90s to create, this behavior is unbelievable. The high number of excessive nodes cost, too.
Hopefully this can be improve sooner and not later.
My case is not exactly the same, but for some reason applying a change to a node group with even only 1 node, it takes too long!! In my case, I'm using CDK, and whenever applying a change to the node group, it starts scaling up new nodes and it would take 30 minutes minimum!!! Even for a simple change!! I sometimes destroy the whole nodeGroup stack and redeploy it.
Managed node group with 0 node (desired size 0) , 1 or 2 , all take 18 minutes. I can understand those with pods taking some time to create/destroy but even with 0 size taking 18 minutes is too long. I am happy I moved to karpenter.sh for all 60+ pods I use, cannot imagine the launch template /AMI upgrade process with more nodes but I need at least 1 managed node group to put karpenter pods in.
One of my customer is still facing this issue while upgrading. One nodegroup has around 18 nodes and it takes 2hrs approx to update. Please expedite this feature and provide a solution.
any new here? I also hit this problem with eks v1.26
me to,eks v1.27
Also hit this issue with EKS 1.27 modifying a node group with only 1 node. Has been running for 25+ minutes, and has spun up several new nodes.
Any news here, I tried to upgrade 1.27 to 1.28 and stuck, It upgraded only 1 host and didn't create 2nd node group, didn't upgrade 2 currently running c5.9xlarge hosts.
This takes over an hour sometimes for nodegroups with a desired size of 1.
Same issue here. NodeGroup with 1 node took 50min to complete update, and spawned 3 more nodes in the process (1 -> 2 -> 3 -> 4 -> 3 -> 2 -> 1). Nodes were Ready for 8min at least between each spawn, and NodeGroup was still updating. Can't wait to update the 20 nodes NodeGroup...
this is so slow terraform times out, what is the problem?
I did a bunch of testing around this and found that reducing the amount of availability zones for the nodegroup makes a significant difference in how quickly this occurs:
1 and 2 az: single node updates took 11~ minutes ten node updates took 16~ minutes
3az: single node updates took 16~ minutes ten node updates took 30~ minutes
4az: single node updates took 50 minutes didn't bother with 10 nodes
Any update on this? Quickly trying out something takes days!
I have an EKS 1.26 managed node group with 1 m5.large instance that takes 20 minutes to replace.
module.eks_nonprod.module.eks.module.eks_managed_node_group["default1"].aws_eks_node_group.this[0]: Modifications complete after 19m16s [id=eks-cluster-24123:default1]
This is a major pain for our team. We manage 70-80 EKS clusters and minor changes (e.g. a tag update) take a very long time to complete because of the eks_managed_node_group
being so slow to update.
Things we've tried:
- Terminating all instances manually does not seem to speed things up.
- Manually updating desired capacity and max capacity on the ASG does seem to help a bit, but it's a major inconvenience to have to do this across all the clusters.
- Setting the "maximum unavailable" config on the EKS node group to 100% does not seem to speed things up either.
Even after all the instances with the old AMI have been terminated manually, the modifications continue to be in progress for a while (10-15 minutes).
At this rate we're considering going back to self managed node groups.
Update: Switching to Karpenter, which means only needing one small managed node group for Karpenter itself, helps a lot as well.
I have just 4 EKS Clusters. But I'm experiencing this same problem. I read somewhere that this behavior may be related to the ClusterAutoscaler, which cannot scale down properly. I tried some things like set cluster-autoscaler.kubernetes.io/enable-ds-eviction=true annotation, but it didn't work.
Any simple update takes at least 40 minutes.
I switched to Karpenter from CA to keep managed ASG as small as possible. It's also possible to get rid of ASG completely by running karpenter on fargate instances. Still quite stupid to spend 10-15m to update 3 small nodes, but it's significantly faster than before.
Same here, took 40+ minutes for migrating from 1.29 to 1.30 on 4 nodes. I really want to come back to GKE,
we having same issue here, it take terraform more than 70+ mins to upgrade from 1.28 to 1.29, and the process is not transparent at all, there basically no log you can referring to during the whole time... it operate on its own pace as a black hole.
I have similar problems and one additional. For some mythical reason, after I update Managed node group launch template with new version (ie, new ami released), it successfully updates node group (yeah, all that 20+ minute process) and then EKS switches launch template BACK automatically to some ancient launch template version. And no explanations given, just switches back and I wait ANOTHER half an hour and face bloody config drift in TF and outdated images running in cluster!
Managed node group implementation is like worst piece of crap there is.
I would recommend that you all switch to Karpenter. Update there takes a couple of minutes and is a breeze.
Another option is create new MNG with upgraded AMI and then migrate workload to the new MNG. Finally scale down and delete the old MNG. This way is faster and allow you to control the migration pace. Not perfect but work much faster.
Another option is create new MNG with upgraded AMI and then migrate workload to the new MNG.
This is what I do, too, and it works pretty well.
If you're using Terraform, too, you can set create_before_destroy
on the node group to have Terraform fully bring up the replacement node group before it destroys the old one. That obviously doesn't work when you're changing something about the resource that wouldn't be considered a replacement (like changing the resource identifier).
Additionally, you can use a terraform_data
resource to track certain parameters of both the node group and the launch template, with a replace_triggered_by
to automatically replace the node group when anything changes that would normally cause an update-in-place.
Here's an example of what I mean:
Click to expand
resource "aws_launch_template" "this" {
...
lifecycle {
create_before_destroy = true
}
}
# This resource tracks various settings of the eks_node_group that would
# normally cause the node group to be rotated in-place. However, for larger
# clusters, in-place rotations can take a long time and sometimes even fail.
# Instead, we can use `replace_triggered_by` pointing at those settings to
# simply replace the node group entirely (creating the new one before the
# destroying the old one, so the pods can be migrated; the cluster-autoscaler
# will handle scaling up the new node group).
resource "terraform_data" "replacement_trigger" {
for_each = var.replace_on_update ? {
# Please keep this map up-to-date with any attributes/arguments (that we
# control) that would cause the node group to rotate. The node group will be
# replaced when any of the values of this input change. The keys are strings,
# to identify the for_each iterators, and the values are the actual value.
# The values must be all the same type, hence the yamlencode(). This converts
# them to strings, while still being relatively easy for a human to parse.
# The values don't REALLY matter here; what matters is that any CHANGE to
# them triggers a change in this resource. Also, additions or subtractions
# to the for_each iterators will not cause the nodes to be replaced, only
# changes to the values of previously-existing instances.
"k8s_version" = yamlencode(var.eks_cluster_info.version)
"launch_template_version" = yamlencode(aws_launch_template.this.default_version)
"launch_template_id" = yamlencode(aws_launch_template.this.id)
"subnet_ids" = yamlencode(var.subnet_ids)
"instance_type" = yamlencode(var.instance_types)
# When `var.replace_on_update` is set to `false`, we cannot provide
# an empty map, since the `replace_triggered_by` field in the nodes
# resource will be pointing to an empty resource; this will cause an error
# during planning. So, we provide a static "dummy" iterator that will never
# change.
} : { "disabled" = true }
input = each.value
}
resource "aws_eks_node_group" "this" {
...
launch_template {
name = aws_launch_template.this.name
version = aws_launch_template.this.default_version
}
lifecycle {
create_before_destroy = true
replace_triggered_by = [
terraform_data.replacement_trigger,
]
}
}
The main caveat with creating and destroying in one operation is that your workloads need to be able to handle quickly migrating from one node to another. When AWS brings down the instances from the old node group, it doesn't fully respect instance life cycles. So this approach should generally only be taken (for production workloads) during maintenance windows. You might encounter brief downtime while your workloads migrate from one node group to the other.