eksctl
eksctl copied to clipboard
[Bug] Scale from Zero does not work on managed nodegroups even with propagateASGTags enabled
What were you trying to accomplish?
Creating a managed nodegroup to support autoscaling from zero with taints and labels does not appear to work properly even with the recommended taints, labels, and propagateASGTags: true
. Creating such a managed nodegroup does not scale up from zero with an error in the cluster autoscaler.
What happened?
This used to work with the same configuration, except that we were previously using unmanaged nodegroups. We switched to managed nodegroups across our fleet until this error was discovered long after we had already made the transition.
W0208 19:41:31.415743 1 orchestrator.go:577] Node group xxx is not ready for scaleup - unhealthy
W0208 19:41:31.420297 1 clusterstate.go:423] Failed to find readiness information for xxx
Further investigation showed that the correct autoscaling taint and label tags are missing from the instances and are not being propagated correctly.
How to reproduce it?
We have created a cluster and two nodegroups (one for regular workloads, one with the taints to support a github actions runner workload with taints and labels):
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: "propagate-test"
region: "us-west-2"
version: "1.29"
tags:
VantaOwner: "Release"
VantaDescription: "EKS cluster for Release"
addons:
- name: "vpc-cni"
version: v1.16.2-eksbuild.1
resolveConflicts: preserve
- name: "kube-proxy"
version: "latest"
resolveConflicts: preserve
- name: "coredns"
version: latest
resolveConflicts: preserve
- name: "aws-ebs-csi-driver"
version: latest
resolveConflicts: preserve
iam:
withOIDC: true
vpc:
cidr: "10.5.0.0/16"
securityGroup: "sg-xyz"
id: "vpc-abc"
subnets:
private:
us-west-2d:
id: subnet-def
us-west-2c:
id: subnet-ghi
us-west-2a:
id: subnet-klm
public:
us-west-2c:
id: subnet-nmo
us-west-2a:
id: subnet-pqr
us-west-2d:
id: subnet-stu
managedNodeGroups:
- name: "default-pool"
amiFamily: "AmazonLinux2"
availabilityZones: ["us-west-2d", "us-west-2c", "us-west-2a"]
disableIMDSv1: true
instanceType: "t3a.xlarge"
minSize: 3
maxSize: 10
asgSuspendProcesses:
- AZRebalance
propagateASGTags: true
privateNetworking: true
volumeSize: 100
volumeEncrypted: true
iam:
attachPolicyARNs:
- "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
- "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
- "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
preBootstrapCommands:
- "echo 'fs.inotify.max_user_watches=3145728' > /etc/sysctl.d/99-inotifiers.conf"
- "echo 'fs.inotify.max_user_instances=4096' >> /etc/sysctl.d/99-inotifiers.conf"
- "sysctl -p /etc/sysctl.d/99-inotifiers.conf"
ssh:
allow: false
updateConfig:
maxUnavailable: 1
tags:
VantaOwner: "Release"
VantaDescription: "EKS nodes for Release"
- name: "standard-workers-4c7826"
amiFamily: "AmazonLinux2"
availabilityZones: ["us-west-2d", "us-west-2c", "us-west-2a"]
disableIMDSv1: true
instanceType: "m6a.2xlarge"
labels: {"ghRunners":"true"}
minSize: 0
maxSize: 10
asgSuspendProcesses:
- AZRebalance
propagateASGTags: true
privateNetworking: true
volumeSize: 100
volumeEncrypted: true
iam:
attachPolicyARNs:
- "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
- "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
- "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
preBootstrapCommands:
- "echo 'fs.inotify.max_user_watches=3145728' > /etc/sysctl.d/99-inotifiers.conf"
- "echo 'fs.inotify.max_user_instances=4096' >> /etc/sysctl.d/99-inotifiers.conf"
- "sysctl -p /etc/sysctl.d/99-inotifiers.conf"
ssh:
allow: false
updateConfig:
maxUnavailable: 1
tags:
VantaOwner: "Release"
VantaDescription: "EKS nodes for Release"
taints: [{"key":"nodeUsage","value":"ghRunners","effect":"NoSchedule"}]
secretsEncryption:
keyARN: "arn:aws:kms:us-west-2:1234:alias/release/propagate-test"
Here is the output from the autoscaler group:
{
"AutoScalingGroups": [
{
"AutoScalingGroupName": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"AutoScalingGroupARN": "arn:aws:autoscaling:us-west-2:1234:autoScalingGroup:54895338-8bb8-4dbb-aa86-665b8e5a772b:autoScalingGroupName/eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"MixedInstancesPolicy": {
"LaunchTemplate": {
"LaunchTemplateSpecification": {
"LaunchTemplateId": "lt-0db2f8889d53466b5",
"LaunchTemplateName": "eks-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"Version": "1"
},
"Overrides": [
{
"InstanceType": "t3a.xlarge"
}
]
},
"InstancesDistribution": {
"OnDemandAllocationStrategy": "prioritized",
"OnDemandBaseCapacity": 0,
"OnDemandPercentageAboveBaseCapacity": 100,
"SpotAllocationStrategy": "lowest-price",
"SpotInstancePools": 2
}
},
"MinSize": 3,
"MaxSize": 10,
"DesiredCapacity": 3,
"DefaultCooldown": 300,
"AvailabilityZones": [
"us-west-2a",
"us-west-2c",
"us-west-2d"
],
"LoadBalancerNames": [],
"TargetGroupARNs": [],
"HealthCheckType": "EC2",
"HealthCheckGracePeriod": 15,
"Instances": [
...
],
"CreatedTime": "2024-02-09T01:24:52.116000+00:00",
"SuspendedProcesses": [
{
"ProcessName": "AZRebalance",
"SuspensionReason": "User suspended at 2024-02-09T01:26:39Z"
}
],
"VPCZoneIdentifier": "subnet-abc,subnet-def,subnet-ghi",
"EnabledMetrics": [],
"Tags": [
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "VantaDescription",
"Value": "EKS nodes for Release",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "VantaOwner",
"Value": "Release",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "alpha.eksctl.io/nodegroup-name",
"Value": "default-pool",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "alpha.eksctl.io/nodegroup-type",
"Value": "managed",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "eks:cluster-name",
"Value": "propagate-test",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "eks:nodegroup-name",
"Value": "default-pool",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/enabled",
"Value": "true",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/alpha.eksctl.io/cluster-name",
"Value": "propagate-test",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/alpha.eksctl.io/nodegroup-name",
"Value": "default-pool",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/propagate-test",
"Value": "owned",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-default-pool-cec6c5bb-9cb5-774a-d4c9-4abb88e0b176",
"ResourceType": "auto-scaling-group",
"Key": "kubernetes.io/cluster/propagate-test",
"Value": "owned",
"PropagateAtLaunch": true
}
],
"TerminationPolicies": [
"AllocationStrategy",
"OldestLaunchTemplate",
"OldestInstance"
],
"NewInstancesProtectedFromScaleIn": false,
"ServiceLinkedRoleARN": "arn:aws:iam::1234:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling",
"CapacityRebalance": true,
"TrafficSources": []
},
{
"AutoScalingGroupName": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"AutoScalingGroupARN": "arn:aws:autoscaling:us-west-2:1234:autoScalingGroup:0943300e-1e6a-4d5e-a357-39364bb96459:autoScalingGroupName/eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"MixedInstancesPolicy": {
"LaunchTemplate": {
"LaunchTemplateSpecification": {
"LaunchTemplateId": "lt-07b8ddd0f76a5add3",
"LaunchTemplateName": "eks-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"Version": "1"
},
"Overrides": [
{
"InstanceType": "m6a.2xlarge"
}
]
},
"InstancesDistribution": {
"OnDemandAllocationStrategy": "prioritized",
"OnDemandBaseCapacity": 0,
"OnDemandPercentageAboveBaseCapacity": 100,
"SpotAllocationStrategy": "lowest-price",
"SpotInstancePools": 2
}
},
"MinSize": 0,
"MaxSize": 10,
"DesiredCapacity": 0,
"DefaultCooldown": 300,
"AvailabilityZones": [
"us-west-2a",
"us-west-2c",
"us-west-2d"
],
"LoadBalancerNames": [],
"TargetGroupARNs": [],
"HealthCheckType": "EC2",
"HealthCheckGracePeriod": 15,
"Instances": [],
"CreatedTime": "2024-02-10T01:20:01.550000+00:00",
"SuspendedProcesses": [
{
"ProcessName": "AZRebalance",
"SuspensionReason": "User suspended at 2024-02-10T01:20:50Z"
}
],
"VPCZoneIdentifier": "subnet-abc,subnet-def,subnet-ghi",
"EnabledMetrics": [],
"Tags": [
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "VantaDescription",
"Value": "EKS nodes for Release",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "VantaOwner",
"Value": "Release",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "alpha.eksctl.io/nodegroup-name",
"Value": "standard-workers-4c7826",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "alpha.eksctl.io/nodegroup-type",
"Value": "managed",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "eks:cluster-name",
"Value": "propagate-test",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "eks:nodegroup-name",
"Value": "standard-workers-4c7826",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/enabled",
"Value": "true",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/alpha.eksctl.io/cluster-name",
"Value": "propagate-test",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/alpha.eksctl.io/nodegroup-name",
"Value": "standard-workers-4c7826",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/ghRunners",
"Value": "true",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/taint/nodeUsage",
"Value": "ghRunners",
"PropagateAtLaunch": false
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/propagate-test",
"Value": "owned",
"PropagateAtLaunch": true
},
{
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "kubernetes.io/cluster/propagate-test",
"Value": "owned",
"PropagateAtLaunch": true
}
],
"TerminationPolicies": [
"AllocationStrategy",
"OldestLaunchTemplate",
"OldestInstance"
],
"NewInstancesProtectedFromScaleIn": false,
"ServiceLinkedRoleARN": "arn:aws:iam::1234:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling",
"CapacityRebalance": true,
"TrafficSources": []
}
]
}
Logs
2024-02-09 01:11:15 [] eksctl version 0.170.0
2024-02-09 01:11:15 [] using region us-west-2
2024-02-09 01:11:15 [] setting availability zones to [us-west-2c us-west-2d us-west-2a]
2024-02-09 01:11:15 [] subnets for us-west-2c - public:10.5.0.0/19 private:10.5.96.0/19
2024-02-09 01:11:15 [] subnets for us-west-2d - public:10.5.32.0/19 private:10.5.128.0/19
2024-02-09 01:11:15 [] subnets for us-west-2a - public:10.5.64.0/19 private:10.5.160.0/19
2024-02-09 01:11:15 [] nodegroup "default-pool" will use "" [AmazonLinux2/1.29]
2024-02-09 01:11:15 [] using Kubernetes version 1.29
2024-02-09 01:11:15 [] creating EKS cluster "propagate-test" in "us-west-2" region with managed nodes
2024-02-09 01:11:15 [] 1 nodegroup (default-pool) was included (based on the include/exclude rules)
2024-02-09 01:11:15 [] will create a CloudFormation stack for cluster itself and 0 nodegroup stack(s)
2024-02-09 01:11:15 [] will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s)
2024-02-09 01:11:15 [] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=propagate-test'
2024-02-09 01:11:15 [] Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "propagate-test" in "us-west-2"
2024-02-09 01:11:15 [] CloudWatch logging will not be enabled for cluster "propagate-test" in "us-west-2"
2024-02-09 01:11:15 [] you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-west-2 --cluster=propagate-test'
2024-02-09 01:11:15 []
2 sequential tasks: { create cluster control plane "propagate-test",
2 sequential sub-tasks: {
5 sequential sub-tasks: {
wait for control plane to become ready,
associate IAM OIDC provider,
no tasks,
restart daemonset "kube-system/aws-node",
1 task: { create addons },
},
2 sequential sub-tasks: {
create managed nodegroup "default-pool",
propagate tags to ASG for managed nodegroup "default-pool",
},
}
}
2024-02-09 01:11:15 [] building cluster stack "eksctl-propagate-test-cluster"
2024-02-09 01:11:15 [] deploying stack "eksctl-propagate-test-cluster"
2024-02-09 01:11:45 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:12:15 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:13:15 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:14:15 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:15:15 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:16:15 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:17:15 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:18:16 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:19:16 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:20:16 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:21:16 [] waiting for CloudFormation stack "eksctl-propagate-test-cluster"
2024-02-09 01:23:17 [] daemonset "kube-system/aws-node" restarted
2024-02-09 01:23:17 [] creating role using recommended policies
2024-02-09 01:23:18 [] deploying stack "eksctl-propagate-test-addon-vpc-cni"
2024-02-09 01:23:18 [] waiting for CloudFormation stack "eksctl-propagate-test-addon-vpc-cni"
2024-02-09 01:23:48 [] waiting for CloudFormation stack "eksctl-propagate-test-addon-vpc-cni"
2024-02-09 01:23:48 [] creating addon
2024-02-09 01:23:58 [] addon "vpc-cni" active
2024-02-09 01:23:59 [] building managed nodegroup stack "eksctl-propagate-test-nodegroup-default-pool"
2024-02-09 01:23:59 [] deploying stack "eksctl-propagate-test-nodegroup-default-pool"
2024-02-09 01:23:59 [] waiting for CloudFormation stack "eksctl-propagate-test-nodegroup-default-pool"
2024-02-09 01:24:29 [] waiting for CloudFormation stack "eksctl-propagate-test-nodegroup-default-pool"
2024-02-09 01:25:22 [] waiting for CloudFormation stack "eksctl-propagate-test-nodegroup-default-pool"
2024-02-09 01:26:38 [] waiting for CloudFormation stack "eksctl-propagate-test-nodegroup-default-pool"
2024-02-09 01:26:38 [] waiting for the control plane to become ready
2024-02-09 01:26:38 [] saved kubeconfig as "/root/.kube/config"
2024-02-09 01:26:38 [] 1 task: { suspend ASG processes for nodegroup default-pool }
2024-02-09 01:26:39 [] suspended ASG processes [AZRebalance] for default-pool
2024-02-09 01:26:39 [] all EKS cluster resources for "propagate-test" have been created
2024-02-09 01:26:39 [] nodegroup "default-pool" has 3 node(s)
2024-02-09 01:26:39 [] node "ip-10-5-103-234.us-west-2.compute.internal" is ready
2024-02-09 01:26:39 [] node "ip-10-5-143-14.us-west-2.compute.internal" is ready
2024-02-09 01:26:39 [] node "ip-10-5-171-92.us-west-2.compute.internal" is ready
2024-02-09 01:26:39 [] waiting for at least 3 node(s) to become ready in "default-pool"
2024-02-09 01:26:39 [] nodegroup "default-pool" has 3 node(s)
2024-02-09 01:26:39 [] node "ip-10-5-103-234.us-west-2.compute.internal" is ready
2024-02-09 01:26:39 [] node "ip-10-5-143-14.us-west-2.compute.internal" is ready
2024-02-09 01:26:39 [] node "ip-10-5-171-92.us-west-2.compute.internal" is ready
2024-02-09 01:26:39 [] no recommended policies found, proceeding without any IAM
2024-02-09 01:26:39 [] creating addon
2024-02-09 01:27:22 [] addon "kube-proxy" active
2024-02-09 01:27:22 [] no recommended policies found, proceeding without any IAM
2024-02-09 01:27:22 [] creating addon
2024-02-09 01:27:33 [] addon "coredns" active
2024-02-09 01:27:33 [] creating role using recommended policies
2024-02-09 01:27:33 [] deploying stack "eksctl-propagate-test-addon-aws-ebs-csi-driver"
2024-02-09 01:27:33 [] waiting for CloudFormation stack "eksctl-propagate-test-addon-aws-ebs-csi-driver"
2024-02-09 01:28:03 [] waiting for CloudFormation stack "eksctl-propagate-test-addon-aws-ebs-csi-driver"
2024-02-09 01:28:47 [] waiting for CloudFormation stack "eksctl-propagate-test-addon-aws-ebs-csi-driver"
2024-02-09 01:28:47 [] creating addon
2024-02-09 01:29:33 [] addon "aws-ebs-csi-driver" active
2024-02-09 01:29:33 [] kubectl command should work with "/root/.kube/config", try 'kubectl get nodes'
2024-02-09 01:29:33 [] EKS cluster "propagate-test" in "us-west-2" region is ready
2024-02-10 01:19:25 [] nodegroup "default-pool" will use "" [AmazonLinux2/1.29]
2024-02-10 01:19:25 [] nodegroup "standard-workers-4c7826" will use "" [AmazonLinux2/1.29]
2024-02-10 01:19:25 [] 1 existing nodegroup(s) (default-pool) will be excluded
2024-02-10 01:19:25 [] 1 nodegroup (standard-workers-4c7826) was included (based on the include/exclude rules)
2024-02-10 01:19:25 [] will create a CloudFormation stack for each of 1 managed nodegroups in cluster "propagate-test"
2024-02-10 01:19:25 []
2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: {
2 sequential sub-tasks: {
create managed nodegroup "standard-workers-4c7826",
propagate tags to ASG for managed nodegroup "standard-workers-4c7826",
} } }
}
2024-02-10 01:19:25 [] checking cluster stack for missing resources
2024-02-10 01:19:26 [] cluster stack has all required resources
2024-02-10 01:19:26 [] building managed nodegroup stack "eksctl-propagate-test-nodegroup-standard-workers-4c7826"
2024-02-10 01:19:26 [] deploying stack "eksctl-propagate-test-nodegroup-standard-workers-4c7826"
2024-02-10 01:19:26 [] waiting for CloudFormation stack "eksctl-propagate-test-nodegroup-standard-workers-4c7826"
2024-02-10 01:19:56 [] waiting for CloudFormation stack "eksctl-propagate-test-nodegroup-standard-workers-4c7826"
2024-02-10 01:20:49 [] waiting for CloudFormation stack "eksctl-propagate-test-nodegroup-standard-workers-4c7826"
2024-02-10 01:20:50 [] 1 task: { suspend ASG processes for nodegroup standard-workers-4c7826 }
2024-02-10 01:20:50 [] suspended ASG processes [AZRebalance] for standard-workers-4c7826
2024-02-10 01:20:50 [] created 0 nodegroup(s) in cluster "propagate-test"
2024-02-10 01:20:50 [] created 1 managed nodegroup(s) in cluster "propagate-test"
2024-02-10 01:20:50 [] checking security group configuration for all nodegroups
2024-02-10 01:20:50 [] all nodegroups have up-to-date cloudformation templates
Anything else we need to know? Notice in particular
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/taint/nodeUsage",
"Value": "ghRunners",
"PropagateAtLaunch": false
Should be "PropagateAtLaunch": true
I'll update as I get this information.
Versions
eksctl version: 0.170.0
kubectl version: v1.26.13
OS: linux
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
@rwilson-release
The difference you are noticing is expected because you started using ManagedNodeGroup and the PropagateAtLaunch
is statically set as false
.
However, ClusterAutoscaler (CAS) does not have any hard requirement to propagate these tags to Nodes.. CAS only requires the tags to be present on ASG so that it can scale from 0 and after launching the node, CAS will add those taints and labels. Having PropagateAtLaunch: true
should not make any difference on whether CA adds those taints properly or not.
On the other hand for unmanaged nodes the PropagateAtLaunch
is set to true
, @TiberiuGC can you shed some light on this?
Update:
The scale up by CAS should happen even if of PropagateAtLaunch: false
.
Here is my Cluster-Config:
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: "propagate-test"
region: "us-west-2"
version: "1.29"
tags:
VantaOwner: "Release"
VantaDescription: "EKS cluster for Release"
addons:
- name: "vpc-cni"
- name: "kube-proxy"
- name: "coredns"
- name: "aws-ebs-csi-driver"
iam:
withOIDC: true
managedNodeGroups:
- name: "standard-workers-4c7827"
amiFamily: "AmazonLinux2"
disableIMDSv1: true
instanceType: "m6a.2xlarge"
labels: {"ghRunners":"true"}
minSize: 0
maxSize: 10
propagateASGTags: true
privateNetworking: true
iam:
attachPolicyARNs:
- "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
- "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
- "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
tags:
VantaOwner: "Release"
VantaDescription: "EKS nodes for Release"
taints: [{"key":"nodeUsage","value":"ghRunners","effect":"NoSchedule"}]
Test Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: label-test
name: label-test
spec:
replicas: 1
selector:
matchLabels:
app: label-test
template:
metadata:
labels:
app: label-test
spec:
nodeSelector:
ghRunners: "true"
tolerations:
- key: nodeUsage
value: ghRunners
effect: NoSchedule
containers:
- image: nginx
name: nginx
Scale up log:
I0422 20:58:20.491115 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"label-test-6bf88dd7c5-kqdcb", UID:"26973148-4d7d-44ef-8242-75e835903ed3", APIVersion:"v1", ResourceVersion:"17802", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eks-standard-workers-4c7827-04c783ae-70db-f136-a40e-80fe783199d2 0->1 (max: 10)}]
@rwilson-release
My suggestion is to investigate CAS logs for identifying the reason for Node group xxx is not ready for scaleup - unhealthy
. This is not related to eksctl.
Thanks
The difference you are noticing is expected because you started using ManagedNodeGroup and the PropagateAtLaunch is statically set as false.
On the other hand for unmanaged nodes the PropagateAtLaunch is set to true
That code is for tagging the backing ASGs for managed nodegroups and PropagateAtLaunch
is set to false for the ASG resource itself. I believe the reason it's false for managed nodegroups is that the EKS Managed Nodegroups API also propagates tags to the EC2 instances, so eksctl tries not to override them by propagating them.
For managed nodegroups, EKS does not propagate any tags to the ASG resource, they only apply to the EKS Nodegroup resource and to the EC2 instances launched as part of the nodegroup.
@rwilson-release,
Notice in particular
"ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c", "ResourceType": "auto-scaling-group", "Key": "k8s.io/cluster-autoscaler/node-template/taint/nodeUsage", "Value": "ghRunners", "PropagateAtLaunch": false
Should be "PropagateAtLaunch": true
You are viewing tags for the ASG resource itself, those tags do not need to be propagated anywhere for scale-from-zero to work.
As @punkwalker noted, this is not an eksctl bug and you might be facing other issues. Can you try upgrading CAS?
Additionally, if the IAM role for Cluster Autoscaler has the eks:DescribeNodegroup
permission, CAS can use that to pull labels and taints from the EKS API, eliminating the need to use propagateASGTags: true
. The set of policies for CAS was updated last year in eksctl so if it was created before that, it'll be missing from your role.
As punkwalker noted, this is not an eksctl bug and you might be facing other issues. Can you try upgrading CAS?
I created a brand new cluster on 1.29 with the CAS version 1.29.0, I am pretty sure that is relatively recent. There was a new version released a few days ago, so maybe there is a long shot there.
Additionally, if the IAM role for Cluster Autoscaler has the
eks:DescribeNodegroup
permission...
I was initially excited about this possibility since we had a few clusters of different ages and different eksctl versions and this seemed like a very simple fix! Unfortunately, we have updated the policy and use the helm charts with the correct policy and verifying all affected clusters confirmed the policy was correct, including the one you mentioned. This policy has been correctly updated in our configs since 2022-Dec.
My suggestion is to investigate CAS logs for identifying the reason for
Node group xxx is not ready for scaleup - unhealthy
. This is not related to eksctl.
I almost agreed this was not related to eksctl but all roads lead back to the label or tags -- and those are created by eksctl so please bear with me. If you investigate the errors that I posted, you will find several github issues -- in particular this thread which describes almost identical problems related to scaling from 0 for the purposes of github self-hosted runners (our use case). See https://github.com/kubernetes/autoscaler/issues/3780#issuecomment-1202601189 but none of the fixes in that thread help.
Another clue is here in the README: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup (scroll down to the sections labled The following is only required if scaling up from 0 nodes
where the examples are listed. In particular,
k8s.io/cluster-autoscaler/node-template/label/foo: bar
for labels (which are correct ✅ )
k8s.io/cluster-autoscaler/node-template/taint/dedicated: true:NoSchedule
(which are slightly off ⚠️ )
The actual tag I see is
"Key": "k8s.io/cluster-autoscaler/node-template/taint/nodeUsage",
"Value": "ghRunners",
Notice it might need to be
"Value": "ghRunners:NoSchedule",
This is a bit of a reach -- but is it possible the unmanaged node groups are labeled correctly vs the managed?
@rwilson-release Thank you for pointing that out.
is it possible the unmanaged node groups are labeled correctly vs the managed?
Even for unmanaged nodegroups, the taint effect was never added to the ASG tag :slightly_smiling_face:. Ref
for _, taint := range taints {
addTag(taintsPrefix+taint.Key, taint.Value)
}
I think it has to be changed to something like this:
for _, taint := range taints {
addTag(taintsPrefix+taint.Key, taint.Value+":"+string(taint.Effect))
}
@cPu1 What do you think?