autoscaler
autoscaler copied to clipboard
Scaling different arch nodegroups
Which component are you using?: cluster-autoscaler
What version of the component are you using?: v1.23.0
Component version:
kubectl version
Output
$ kubectl version Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.14-eks-18ef993", GitCommit:"ac73613dfd25370c18cbbbc6bfc65449397b35c7", GitTreeState:"clean", BuildDate:"2022-07-06T18:06:50Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
What environment is this in?: AWS EKS
What did you expect to happen?: Scaling a node group that has a different architecture shall work
What happened instead?: No new nodes are created in the node group
How to reproduce it (as minimally and precisely as possible): Create two nodegroups having almost the same configuration except for their architecture.
Terraform example using cloudposse nodegroup module
amd64
module "ci_cd_node_group" {
source = "cloudposse/eks-node-group/aws"
version = "0.28.1"
name = "cicd"
cluster_name = "mycluster"
subnet_ids = var.subnet_ids
create_before_destroy = true
instance_types = ["m6a.2xlarge"]
desired_size = 1 # keep at one : https://github.com/kubernetes/autoscaler/issues/3133
min_size = 0
max_size = 21
kubernetes_labels = {
type = "cicd"
arch = "amd64"
}
kubernetes_taints = [
{
key = "scope"
value = "cicd"
effect = "NO_SCHEDULE"
}
]
cluster_autoscaler_enabled = true
}
arm64
module "ci_cd_arm64_node_group" {
source = "cloudposse/eks-node-group/aws"
version = "0.28.1"
name = "cicd-arm64"
cluster_name = "mycluster"
subnet_ids = var.subnet_ids
create_before_destroy = true
ami_type = "AL2_ARM_64"
instance_types = ["m6g.2xlarge"]
desired_size = 1 # keep at one : https://github.com/kubernetes/autoscaler/issues/3133
min_size = 0
max_size = 4
kubernetes_labels = {
type = "cicd"
arch = "arm64"
}
kubernetes_taints = [
{
key = "scope"
value = "cicd"
effect = "NO_SCHEDULE"
}
]
cluster_autoscaler_enabled = true
}
We also use the following resources to add the tags on the ASG required for scaling from 0 :
asg tags
data "aws_autoscaling_groups" "cicd_arm64_asgs" {
filter {
name = "tag:eks:nodegroup-name"
values = ["cicd-arm64-workers-${module.ci_cd_arm64_node_group.eks_node_group_cbd_pet_name}"]
}
}
# Add tags for autoscaling from 0 directly to the ASG : https://github.com/aws/containers-roadmap/issues/608
resource "aws_autoscaling_group_tag" "autoscale_arm64_from_0_type" {
for_each = toset(data.aws_autoscaling_groups.cicd_arm64_asgs.names)
autoscaling_group_name = each.key
tag {
key = "k8s.io/cluster-autoscaler/node-template/label/type"
propagate_at_launch = true
value = "cicd"
}
}
# Add tags for autoscaling from 0 directly to the ASG : https://github.com/aws/containers-roadmap/issues/608
resource "aws_autoscaling_group_tag" "autoscale_arm64_from_0_arch" {
for_each = toset(data.aws_autoscaling_groups.cicd_arm64_asgs.names)
autoscaling_group_name = each.key
tag {
key = "k8s.io/cluster-autoscaler/node-template/label/arch"
propagate_at_launch = true
value = "arm64"
}
}
Even with all of this, we have issues scaling on the arm64 nodegroup. The pods stay in Pending status and the events keep on showing :
Normal NotTriggerScaleUp 71s cluster-autoscaler pod didn't trigger scale-up: , 1 Insufficient vpc.amazonaws.com/pod-eni
And the cluster-autoscaler logs show the following messages:
Pod runner-cx4n1x9y-project-199-concurrent-05nhtv can't be scheduled on eks-cicd-arm64-workers-eagle-38c17983-6c91-3b05-44c3-2da08d3f7c2c, predicate checking error: Insufficient vpc.amazonaws.com/pod-eni; predicateName=NodeResourcesFit; reasons: Insufficient vpc.amazonaws.com/pod-eni; debugInfo=
For information, we schedule pods onto the nodes using both tolerations and node selectors, as shown bellow :
nodeSelector:
arch: arm64
kubernetes.io/arch: arm64
type: cicd
tolerations:
- effect: NoSchedule
key: kubernetes.io/arch
operator: Equal
value: arm64
- effect: NoSchedule
key: scope
operator: Equal
value: cicd
Is there anything that I am missing here ?