eksctl
eksctl copied to clipboard
[Help] eksctl + cluster-autoscaler: scale-up from minSize=0
I'm having trouble getting the cluster-autoscaler to work with a node group of minSize: 0. I’ve followed the eksctl docs on the topic and set my labels and taints as tags on the nodeGroup definition:
minSize: 0
maxSize: 1
instanceType: r6g.large
labels:
role: worker
node-role.exaring.net/workload-type: mem-intensive
taints:
- key: "node.cilium.io/agent-not-ready"
value: "true"
effect: "NoSchedule"
- key: node-role.exaring.net/workload-type
value: mem-intensive
effect: NoSchedule
- key: arch
value: arm64
effect: NoSchedule
availabilityZones:
- eu-central-1a
tags:
k8s.io/cluster-autoscaler/node-template/label/role: worker
k8s.io/cluster-autoscaler/node-template/label/node-role.exaring.net/workload-type: mem-intensive
k8s.io/cluster-autoscaler/node-template/taint/node.cilium.io/agent-not-ready: true:NoSchedule
k8s.io/cluster-autoscaler/node-template/taint/node.cilium.io/node-role.exaring.net/workload-type: mem-intensive:NoSchedule
k8s.io/cluster-autoscaler/node-template/taint/arch: arm64:NoSchedule
Checking the node group in the AWS console, the tags and taints seem to be correctly set on the node group itself. However, the autoscaler logs state:
klogx.go:86] Pod xxx is unschedulable
scale_up.go:300] Pod xxx can't be scheduled on yyy, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector;
cale_up.go:453] No expansion options
Has anyone experienced this before? How can I debug this further?
Hello obitech :wave: Thank you for opening an issue in eksctl
project. The team will review the issue and aim to respond within 1-3 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl
on our website
Hi! Sorry for the delay in getting back to you. I recommend making the following changes:
nodeGroups:
- name: name
minSize: 0
maxSize: 1
instanceType: r6g.large
iam:
withAddonPolicies:
autoScaler: true // adds the tags required for the Cluster Autoscaler to scale the nodegroup(s)
taints:
- key: "node.cilium.io/agent-not-ready"
value: "true"
effect: "NoSchedule"
- key: node-role.exaring.net/workload-type // removed the label that was repeating this
value: mem-intensive
effect: NoSchedule
- key: arch
value: arm64
effect: NoSchedule
propagateASGTags: true // propagates taints into ASG tags
availabilityZones:
- eu-central-1a
Please let me know if this solved the issue for you :)
I have a PR open to improve the docs around this that should be out soon.
Unfortunately the issue persists 😞 I forgot to mention it's a managed nodegroup, might that be the issue?
Yes, propagateASGTags
has a slightly different behaviour for managed nodegroups (we have an open ticket to unify the behaviour for managed and unmanaged nodegroups). Currently, with propagateASGTags
set to true
, the labels and taints of managed nodegroup are not converted to nodegroup tags so they have to be added manually, like what you were doing before:
managedNodeGroups:
- name: name
minSize: 0
maxSize: 1
instanceType: r6g.large
iam:
withAddonPolicies:
autoScaler: true // adds the tags required for the Cluster Autoscaler to scale the nodegroup(s)
taints:
- key: "node.cilium.io/agent-not-ready"
value: "true"
effect: "NoSchedule"
- key: node-role.exaring.net/workload-type // removed the label that was repeating this
value: mem-intensive
effect: NoSchedule
- key: arch
value: arm64
effect: NoSchedule
tags:
tags:
k8s.io/cluster-autoscaler/node-template/label/role: worker
k8s.io/cluster-autoscaler/node-template/label/node-role.exaring.net/workload-type: mem-intensive
k8s.io/cluster-autoscaler/node-template/taint/node.cilium.io/agent-not-ready: true:NoSchedule
k8s.io/cluster-autoscaler/node-template/taint/node.cilium.io/node-role.exaring.net/workload-type: mem-intensive:NoSchedule
k8s.io/cluster-autoscaler/node-template/taint/arch: arm64:NoSchedule
propagateASGTags: true // propagates taints into ASG tags
availabilityZones:
- eu-central-1a
The important part here is to add propagateASGTags
to propagate nodegroup tags into ASG tags so that the Auto Scaling Group can pick up the nodegroups. :)
We recently updated the docs to reflect all of this in a better way: https://eksctl.io/usage/autoscaling/
This should hopefully solve the issue, please let us know if it did or not!
That did in fact work! Thank you for your help @nikimanoledaki 😌