autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Scaling different arch nodegroups

Open xeivieni opened this issue 2 years ago • 0 comments

Which component are you using?: cluster-autoscaler

What version of the component are you using?: v1.23.0

Component version:

kubectl version Output
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.14-eks-18ef993", GitCommit:"ac73613dfd25370c18cbbbc6bfc65449397b35c7", GitTreeState:"clean", BuildDate:"2022-07-06T18:06:50Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: AWS EKS

What did you expect to happen?: Scaling a node group that has a different architecture shall work

What happened instead?: No new nodes are created in the node group

How to reproduce it (as minimally and precisely as possible): Create two nodegroups having almost the same configuration except for their architecture.

Terraform example using cloudposse nodegroup module

amd64
module "ci_cd_node_group" {
  source  = "cloudposse/eks-node-group/aws"
  version = "0.28.1"

  name                  = "cicd"
  cluster_name          = "mycluster"
  subnet_ids            = var.subnet_ids
  create_before_destroy = true

  instance_types = ["m6a.2xlarge"]
  desired_size   = 1 # keep at one : https://github.com/kubernetes/autoscaler/issues/3133
  min_size       = 0
  max_size       = 21

  kubernetes_labels = {
    type = "cicd"
    arch = "amd64"
  }
  kubernetes_taints = [
    {
      key    = "scope"
      value  = "cicd"
      effect = "NO_SCHEDULE"
    }
  ]

  cluster_autoscaler_enabled = true
}
arm64
module "ci_cd_arm64_node_group" {
  source  = "cloudposse/eks-node-group/aws"
  version = "0.28.1"

  name                  = "cicd-arm64"
  cluster_name          = "mycluster"
  subnet_ids            = var.subnet_ids
  create_before_destroy = true

  ami_type       = "AL2_ARM_64"
  instance_types = ["m6g.2xlarge"]
  desired_size   = 1 # keep at one : https://github.com/kubernetes/autoscaler/issues/3133
  min_size       = 0
  max_size       = 4

  kubernetes_labels = {
    type = "cicd"
    arch = "arm64"
  }
  kubernetes_taints = [
    {
      key    = "scope"
      value  = "cicd"
      effect = "NO_SCHEDULE"
    }
  ]

  cluster_autoscaler_enabled = true
}

We also use the following resources to add the tags on the ASG required for scaling from 0 :

asg tags

data "aws_autoscaling_groups" "cicd_arm64_asgs" {
  filter {
    name   = "tag:eks:nodegroup-name"
    values = ["cicd-arm64-workers-${module.ci_cd_arm64_node_group.eks_node_group_cbd_pet_name}"]
  }
}


# Add tags for autoscaling from 0 directly to the ASG : https://github.com/aws/containers-roadmap/issues/608
resource "aws_autoscaling_group_tag" "autoscale_arm64_from_0_type" {
  for_each               = toset(data.aws_autoscaling_groups.cicd_arm64_asgs.names)
  autoscaling_group_name = each.key
  tag {
    key                 = "k8s.io/cluster-autoscaler/node-template/label/type"
    propagate_at_launch = true
    value               = "cicd"
  }
}

# Add tags for autoscaling from 0 directly to the ASG : https://github.com/aws/containers-roadmap/issues/608
resource "aws_autoscaling_group_tag" "autoscale_arm64_from_0_arch" {
  for_each               = toset(data.aws_autoscaling_groups.cicd_arm64_asgs.names)
  autoscaling_group_name = each.key
  tag {
    key                 = "k8s.io/cluster-autoscaler/node-template/label/arch"
    propagate_at_launch = true
    value               = "arm64"
  }
}

Even with all of this, we have issues scaling on the arm64 nodegroup. The pods stay in Pending status and the events keep on showing :

Normal   NotTriggerScaleUp  71s                cluster-autoscaler  pod didn't trigger scale-up: , 1 Insufficient vpc.amazonaws.com/pod-eni

And the cluster-autoscaler logs show the following messages:

Pod runner-cx4n1x9y-project-199-concurrent-05nhtv can't be scheduled on eks-cicd-arm64-workers-eagle-38c17983-6c91-3b05-44c3-2da08d3f7c2c, predicate checking error: Insufficient vpc.amazonaws.com/pod-eni; predicateName=NodeResourcesFit; reasons: Insufficient vpc.amazonaws.com/pod-eni; debugInfo=

For information, we schedule pods onto the nodes using both tolerations and node selectors, as shown bellow :

nodeSelector:
  arch: arm64
  kubernetes.io/arch: arm64
  type: cicd

tolerations:
  - effect: NoSchedule
    key: kubernetes.io/arch
    operator: Equal
    value: arm64
  - effect: NoSchedule
    key: scope
    operator: Equal
    value: cicd

Is there anything that I am missing here ?

xeivieni avatar Sep 27 '22 16:09 xeivieni