eksctl
eksctl copied to clipboard
[Bug] maxPodsPerNode does't work with eks 1.22
What were you trying to accomplish?
I'm trying to create a managed node group with a limit on the number of pods per node.
What happened?
The node group is created but maxPodsPerNode is ignored and the nodes use their default value instead (29 in my case for a m5.large node).
How to reproduce it?
$ cat > nodegroup.yaml << EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
-
name: test-max-pods
desiredCapacity: 1
minSize: 1
maxSize: 5
maxPodsPerNode: 12
iam:
withAddonPolicies:
appMesh: true
appMeshPreview: true
autoScaler: true
efs: true
metadata:
name: my-eks-1-22-cluster
region: ca-central-1
version: auto
EOF
$ eksctl create nodegroup -f nodegroup.yaml
Logs Creation log
2022-04-18 14:46:56 [ℹ] using region ca-central-1
2022-04-18 14:46:57 [ℹ] will use version 1.22 for new nodegroup(s) based on control plane version
2022-04-18 14:46:58 [ℹ] nodegroup "test-max-pods" will use "" [AmazonLinux2/1.22]
2022-04-18 14:46:59 [!] retryable error (Throttling: Rate exceeded
status code: 400, request id: e73d37bd-b940-484c-ad78-6312b8b5e6d3) from cloudformation/DescribeStacks - will retry after delay of 6.20133802s
2022-04-18 14:47:06 [ℹ] 4 existing nodegroup(s) (my-eks-1-22-cluster-a,my-eks-1-22-cluster-b,my-eks-1-22-cluster-c,my-eks-1-22-cluster-d) will be excluded
2022-04-18 14:47:06 [ℹ] 1 nodegroup (test-max-pods) was included (based on the include/exclude rules)
2022-04-18 14:47:06 [ℹ] will create a CloudFormation stack for each of 1 managed nodegroups in cluster "my-eks-1-22-cluster"
2022-04-18 14:47:06 [ℹ]
2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create managed nodegroup "test-max-pods" } }
}
2022-04-18 14:47:06 [ℹ] checking cluster stack for missing resources
2022-04-18 14:47:07 [ℹ] cluster stack has all required resources
2022-04-18 14:47:07 [!] retryable error (Throttling: Rate exceeded
status code: 400, request id: 1aaf9b5d-6bb7-4370-a4f5-c982f58dcc34) from cloudformation/DescribeStacks - will retry after delay of 5.132635276s
2022-04-18 14:47:13 [ℹ] building managed nodegroup stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:13 [ℹ] deploying stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:13 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:32 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:51 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:48:07 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:48:25 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:48:41 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:01 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:17 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:33 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:52 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:50:10 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:50:28 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:50:45 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:51:05 [ℹ] waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:51:05 [ℹ] no tasks
2022-04-18 14:51:05 [✔] created 0 nodegroup(s) in cluster "my-eks-1-22-cluster"
2022-04-18 14:51:05 [ℹ] nodegroup "test-max-pods" has 1 node(s)
2022-04-18 14:51:05 [ℹ] node "ip-10-75-1-120.ca-central-1.compute.internal" is ready
2022-04-18 14:51:05 [ℹ] waiting for at least 1 node(s) to become ready in "test-max-pods"
2022-04-18 14:51:05 [ℹ] nodegroup "test-max-pods" has 1 node(s)
2022-04-18 14:51:05 [ℹ] node "ip-10-75-1-120.ca-central-1.compute.internal" is ready
2022-04-18 14:51:05 [✔] created 1 managed nodegroup(s) in cluster "my-eks-1-22-cluster"
2022-04-18 14:51:06 [ℹ] checking security group configuration for all nodegroups
2022-04-18 14:51:06 [ℹ] all godegroups have up-to-date cloudformation templates
kubect describe node/ip-10-75-1-120.ca-central-1.compute.internal
<== removed ==>
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 83873772Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7934440Ki
pods: 29 # <-- Should be 12
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 1930m
ephemeral-storage: 76224326324
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7244264Ki
pods: 29 # <-- Should be 12
System Info:
Machine ID: ec2c7770b7e8fd8b2edd9808f7b986a6
System UUID: ec2c7770-b7e8-fd8b-2edd-9808f7b986a6
Boot ID: 9bbc3b1f-38e7-424b-ac45-b2d093438d75
Kernel Version: 5.4.181-99.354.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.13
Kubelet Version: v1.22.6-eks-7d68063
Kube-Proxy Version: v1.22.6-eks-7d68063
<== removed ==>
Anything else we need to know? Debian 11 with downloaded 0.93.0 binary
Versions
$ eksctl info
eksctl version: 0.93.0
kubectl version: v1.23.5
OS: linux
$ eksctl get clusters --name my-eks-1-22-cluster
2022-04-18 14:55:24 [ℹ] eksctl version 0.93.0
2022-04-18 14:55:24 [ℹ] using region ca-central-1
NAME VERSION STATUS CREATED VPC SUBNETS SECURITYGROUPS PROVIDER
my-eks-1-22-cluster 1.22 ACTIVE 2022-04-14T14:19:15Z vpc-xxx subnet-xxx sg-xxx EKS
Hello, can you please also verify that the created launch template's user data contains MAX_POD setting to 12?
Looks like it. This is the user data:
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=63096ae1a5df4c7b8a9e6a77290c89ef3f47a3a436b02df68a95bf6a8458
--63096ae1a5df4c7b8a9e6a77290c89ef3f47a3a436b02df68a95bf6a8458
Content-Type: text/x-shellscript
Content-Type: charset="us-ascii"
#!/bin/sh
set -ex
sed -i -E "s/^USE_MAX_PODS=\"\\$\{USE_MAX_PODS:-true}\"/USE_MAX_PODS=false/" /etc/eks/bootstrap.sh
KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq ".maxPods=12" $KUBELET_CONFIG)" > $KUBELET_CONFIG
--63096ae1a5df4c7b8a9e6a77290c89ef3f47a3a436b02df68a95bf6a8458--
Okay cool. That's something at least. :)
We'll take a look, but if we provide the right flags, I'm afraid there is little we can do.
Have you tried testing it with more than 12 pods? It might write 29, but it might not allow more than 12 using the controller, or something something AWS magic? :)
Fair enough, it could be something that changed within EKS.
I did test it already, unfortunatly, there was no AWS magic, and I ended up with about 27 pods. That's how I noticed the issue.
Thanks!
Fair enough, it could be something that changed within EKS.
I did test it already, unfortunatly, there was no AWS magic, and I ended up with about 27 pods. That's how I noticed the issue.
I initially suspected that the script eksctl uses to set max pods for managed nodegroups no longer works in EKS 1.22, potentially because the bootstrap script in 1.22 AMIs has changed. But after testing, I can confirm that eksctl is still able to set maxPods in the kubelet config but it's not being honoured.
Fair enough, it could be something that changed within EKS. I did test it already, unfortunatly, there was no AWS magic, and I ended up with about 27 pods. That's how I noticed the issue.
I initially suspected that the script eksctl uses to set max pods for managed nodegroups no longer works in EKS 1.22, potentially because the bootstrap script in 1.22 AMIs has changed. But after testing, I can confirm that eksctl is still able to set
maxPodsin the kubelet config but it's not being honoured.
I have tracked it down to EKS supplying --max-pods as an argument to the kubelet. The implementation for maxPodsPerNode in eksctl writes the maxPods field to the kubelet config, but EKS is now passing --max-pods as an argument to the kubelet, overriding the field in the kubelet config.
We can also work around this but we'll discuss this with the EKS team first as there were some talks about deprecating max pods earlier.
I have tracked it down to EKS supplying
--max-podsas an argument to the kubelet. The implementation formaxPodsPerNodein eksctl writes themaxPodsfield to the kubelet config, but EKS is now passing--max-podsas an argument to the kubelet, overriding the field in the kubelet config.We can also work around this but we'll discuss this with the EKS team first as there were some talks about deprecating max pods earlier.
Thanks for the update! In the meantime, we could work around the issue by setting resource requests on our pods, instead of setting a hardcoded number of pods. We have been thinking about it for a while anyway, that was just the push we needed to take the time and do it.
@matthewdepietro tagging you here as per your request 👍🏻
Adding some context on Managed Nodegroups' behavior - if the VPC CNI is running on >= 1.9, Managed Nodegroups attempts to auto-calculate the value of maxPods and sets it on the kubelet as @cPu1 has found. Managed Nodegroups will look at the different environment variables on the VPC CNI to determine what value to set (it essentially emulates the logic in this calculator script). It takes into account PrefixDelegation, Max ENIs etc.
This logic should only be triggered when the ManagedNodegroups is being created without a custom AMI. When looking to override kubelet config, it's recommended to specify an AMI in the launch template passed to CreateNodegroup since you then get full control over all bootstrap parameters including max pods.
We need to come up with a plan to support this as cleanly as possible without hacks.
Timebox: 1-2 days Document the outcomes here.
Looking into this more, a clean solution to support max pods in eksctl is to resolve the AMI using SSM, passing it as a custom AMI to the MNG API, and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set. This approach, however, breaks eksctl upgrade nodegroup and requires eksctl to handle upgrades for nodegroups that have maxPodsPerNode set.
Looking into this more, a clean solution to support max pods in eksctl is to resolve the AMI using SSM, passing it as a custom AMI to the MNG API, and use a custom bootstrap script, setting
--max-podsto the supplied value, whenmaxPodsPerNodeis set. This approach, however, breakseksctl upgrade nodegroupand requires eksctl to handle upgrades for nodegroups that havemaxPodsPerNodeset.
Alternatively, we can use a workaround/hack that modifies the bootstrap.sh script and removes the --max-pods argument passed in the launch template's user data generated by EKS. This is similar to the how max-pods was implemented previously and requires less effort than the custom AMI approach.
and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set
This is the approach I'd be in favor of. --maxPodsPerNode is essentially a property of the kubelet, and the only supported way to modify your kubeletConfiguration is using custom AMIs with your managed nodegroup so this approach makes sense to me.
I'm not sure I understood the mechanics of the workaround you'd mentioned. I think you meant you could edit the bootstrap script on the AMI itself and remove the max-pods argument that MNG API tries to set, but I'm not sure I understand how eksctl would set the value of maxPodsPerNode on the kubelet itself. Lmk what I'm missing here.
In the long term, we've been thinking of rewriting the EKS bootstrap script so that kubelet parameter overrides can be specified within your UserData section, and it'll be honored by however MNG bootstraps, but it's pending resourcing.
Looking into this more, a clean solution to support max pods in eksctl is to resolve the AMI using SSM, passing it as a custom AMI to the MNG API, and use a custom bootstrap script, setting
--max-podsto the supplied value, whenmaxPodsPerNodeis set. This approach, however, breakseksctl upgrade nodegroupand requires eksctl to handle upgrades for nodegroups that havemaxPodsPerNodeset.Alternatively, we can use a workaround/hack that modifies the bootstrap.sh script and removes the
--max-podsargument passed in the launch template's user data generated by EKS. This is similar to the how max-pods was implemented previously and requires less effort than the custom AMI approach.
I am inclined to this approach as well instead of breaking eksctl upgrade nodegroup ✨
I'm not sure I understood the mechanics of the workaround you'd mentioned. I think you meant you could edit the bootstrap script on the AMI itself and remove the max-pods argument that MNG API tries to set
Correct.
but I'm not sure I understand how eksctl would set the value of maxPodsPerNode on the kubelet itself. Lmk what I'm missing here.
eksctl will set it in the kubelet config, which will then be read by kubelet.
In the long term, we've been thinking of rewriting the EKS bootstrap script so that kubelet parameter overrides can be specified within your UserData section, and it'll be honored by however https://github.com/awslabs/amazon-eks-ami/pull/875, but it's pending resourcing.
Thanks for sharing this. I think we'll go with the workaround for now, given that we already have a similar workaround in place and it requires less effort than using a custom AMI with a custom bootstrap script. We'll revisit this approach after the EKS bootstrap script starts accepting kubelet parameter overrides.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Just dumping this for reference:
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/user_data.md#%EF%B8%8F-caveat https://github.com/awslabs/amazon-eks-ami/issues/873 https://github.com/awslabs/amazon-eks-ami/issues/844
Also, maxPodsPerNode does not seem to work with latest 1.21 AMIs anymore (https://github.com/awslabs/amazon-eks-ami/compare/v20220824...v20220914).
I am using this in new created clusters and its still working https://github.com/awslabs/amazon-eks-ami/issues/844#issuecomment-1048592041 (tested on 1.21, 1.22 and 1.23).
EDIT: It seems that is working again in 1.21 for me with ami-051aa0d5889741142 (EKS 1.21/us-east-2) as of 2022/10/07.
Just dumping this for reference:
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/user_data.md#%EF%B8%8F-caveat awslabs/amazon-eks-ami#873 awslabs/amazon-eks-ami#844
Also, maxPodsPerNode does not seem to work with latest 1.21 AMIs anymore (awslabs/[email protected]).
I am using this in new created clusters and its still working awslabs/amazon-eks-ami#844 (comment) (tested on 1.21, 1.22 and 1.23).
EDIT: It seems that is working again in 1.21 for me with ami-051aa0d5889741142 (EKS 1.21/us-east-2) as of 2022/10/07.
This was fixed by https://github.com/weaveworks/eksctl/pull/5808. You should not run into this issue with a recent version of eksctl.
@cPu1 are you sure that the fix in https://github.com/weaveworks/eksctl/pull/5808/files#diff-3a316f46904258df0dec1e9c9c1d6a89efb06e0637a5c0a6a930c162b5352498R99 is called - the sed appends it in KUBELET_EXTRA_ARGS which is only called if --kubelet-extra-args is passed in?
okay it does set it, but kubelet is running with --max-pods=110 --max-pods=123 where the latter is te maxPodsPerNode value