eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

[Bug] maxPodsPerNode does't work with eks 1.22

Open mathieu-lemay opened this issue 3 years ago • 17 comments

What were you trying to accomplish?

I'm trying to create a managed node group with a limit on the number of pods per node.

What happened?

The node group is created but maxPodsPerNode is ignored and the nodes use their default value instead (29 in my case for a m5.large node).

How to reproduce it?

$ cat > nodegroup.yaml << EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
managedNodeGroups:
  -
    name: test-max-pods
    desiredCapacity: 1
    minSize: 1
    maxSize: 5
    maxPodsPerNode: 12
    iam:
      withAddonPolicies:
        appMesh: true
        appMeshPreview: true
        autoScaler: true
        efs: true
metadata:
  name: my-eks-1-22-cluster
  region: ca-central-1
  version: auto
EOF

$ eksctl create nodegroup -f nodegroup.yaml

Logs Creation log

2022-04-18 14:46:56 [ℹ]  using region ca-central-1
2022-04-18 14:46:57 [ℹ]  will use version 1.22 for new nodegroup(s) based on control plane version
2022-04-18 14:46:58 [ℹ]  nodegroup "test-max-pods" will use "" [AmazonLinux2/1.22]
2022-04-18 14:46:59 [!]  retryable error (Throttling: Rate exceeded
        status code: 400, request id: e73d37bd-b940-484c-ad78-6312b8b5e6d3) from cloudformation/DescribeStacks - will retry after delay of 6.20133802s
2022-04-18 14:47:06 [ℹ]  4 existing nodegroup(s) (my-eks-1-22-cluster-a,my-eks-1-22-cluster-b,my-eks-1-22-cluster-c,my-eks-1-22-cluster-d) will be excluded
2022-04-18 14:47:06 [ℹ]  1 nodegroup (test-max-pods) was included (based on the include/exclude rules)
2022-04-18 14:47:06 [ℹ]  will create a CloudFormation stack for each of 1 managed nodegroups in cluster "my-eks-1-22-cluster"
2022-04-18 14:47:06 [ℹ]
2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create managed nodegroup "test-max-pods" } }
}
2022-04-18 14:47:06 [ℹ]  checking cluster stack for missing resources
2022-04-18 14:47:07 [ℹ]  cluster stack has all required resources
2022-04-18 14:47:07 [!]  retryable error (Throttling: Rate exceeded
        status code: 400, request id: 1aaf9b5d-6bb7-4370-a4f5-c982f58dcc34) from cloudformation/DescribeStacks - will retry after delay of 5.132635276s
2022-04-18 14:47:13 [ℹ]  building managed nodegroup stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:13 [ℹ]  deploying stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:13 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:32 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:47:51 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:48:07 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:48:25 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:48:41 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:01 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:17 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:33 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:49:52 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:50:10 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:50:28 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:50:45 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:51:05 [ℹ]  waiting for CloudFormation stack "eksctl-my-eks-1-22-cluster-nodegroup-test-max-pods"
2022-04-18 14:51:05 [ℹ]  no tasks
2022-04-18 14:51:05 [✔]  created 0 nodegroup(s) in cluster "my-eks-1-22-cluster"
2022-04-18 14:51:05 [ℹ]  nodegroup "test-max-pods" has 1 node(s)
2022-04-18 14:51:05 [ℹ]  node "ip-10-75-1-120.ca-central-1.compute.internal" is ready
2022-04-18 14:51:05 [ℹ]  waiting for at least 1 node(s) to become ready in "test-max-pods"
2022-04-18 14:51:05 [ℹ]  nodegroup "test-max-pods" has 1 node(s)
2022-04-18 14:51:05 [ℹ]  node "ip-10-75-1-120.ca-central-1.compute.internal" is ready
2022-04-18 14:51:05 [✔]  created 1 managed nodegroup(s) in cluster "my-eks-1-22-cluster"
2022-04-18 14:51:06 [ℹ]  checking security group configuration for all nodegroups
2022-04-18 14:51:06 [ℹ]  all godegroups have up-to-date cloudformation templates

kubect describe node/ip-10-75-1-120.ca-central-1.compute.internal

<== removed ==>
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         2
  ephemeral-storage:           83873772Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7934440Ki
  pods:                        29  # <-- Should be 12
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         1930m
  ephemeral-storage:           76224326324
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7244264Ki
  pods:                        29  # <-- Should be 12
System Info:
  Machine ID:                 ec2c7770b7e8fd8b2edd9808f7b986a6
  System UUID:                ec2c7770-b7e8-fd8b-2edd-9808f7b986a6
  Boot ID:                    9bbc3b1f-38e7-424b-ac45-b2d093438d75
  Kernel Version:             5.4.181-99.354.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.13
  Kubelet Version:            v1.22.6-eks-7d68063
  Kube-Proxy Version:         v1.22.6-eks-7d68063
<== removed ==>

Anything else we need to know? Debian 11 with downloaded 0.93.0 binary

Versions

$ eksctl info
eksctl version: 0.93.0
kubectl version: v1.23.5
OS: linux
$ eksctl get clusters --name my-eks-1-22-cluster
2022-04-18 14:55:24 [ℹ]  eksctl version 0.93.0
2022-04-18 14:55:24 [ℹ]  using region ca-central-1
NAME                VERSION STATUS CREATED              VPC     SUBNETS    SECURITYGROUPS PROVIDER
my-eks-1-22-cluster 1.22    ACTIVE 2022-04-14T14:19:15Z vpc-xxx subnet-xxx sg-xxx         EKS

mathieu-lemay avatar Apr 18 '22 19:04 mathieu-lemay

Hello, can you please also verify that the created launch template's user data contains MAX_POD setting to 12?

Skarlso avatar Apr 18 '22 19:04 Skarlso

Looks like it. This is the user data:

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=63096ae1a5df4c7b8a9e6a77290c89ef3f47a3a436b02df68a95bf6a8458

--63096ae1a5df4c7b8a9e6a77290c89ef3f47a3a436b02df68a95bf6a8458
Content-Type: text/x-shellscript
Content-Type: charset="us-ascii"

#!/bin/sh
set -ex
sed -i -E "s/^USE_MAX_PODS=\"\\$\{USE_MAX_PODS:-true}\"/USE_MAX_PODS=false/" /etc/eks/bootstrap.sh
KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq ".maxPods=12" $KUBELET_CONFIG)" > $KUBELET_CONFIG
--63096ae1a5df4c7b8a9e6a77290c89ef3f47a3a436b02df68a95bf6a8458--

mathieu-lemay avatar Apr 18 '22 20:04 mathieu-lemay

Okay cool. That's something at least. :)

We'll take a look, but if we provide the right flags, I'm afraid there is little we can do.

Have you tried testing it with more than 12 pods? It might write 29, but it might not allow more than 12 using the controller, or something something AWS magic? :)

Skarlso avatar Apr 18 '22 20:04 Skarlso

Fair enough, it could be something that changed within EKS.

I did test it already, unfortunatly, there was no AWS magic, and I ended up with about 27 pods. That's how I noticed the issue.

mathieu-lemay avatar Apr 18 '22 20:04 mathieu-lemay

Thanks!

Skarlso avatar Apr 19 '22 05:04 Skarlso

Fair enough, it could be something that changed within EKS.

I did test it already, unfortunatly, there was no AWS magic, and I ended up with about 27 pods. That's how I noticed the issue.

I initially suspected that the script eksctl uses to set max pods for managed nodegroups no longer works in EKS 1.22, potentially because the bootstrap script in 1.22 AMIs has changed. But after testing, I can confirm that eksctl is still able to set maxPods in the kubelet config but it's not being honoured.

cPu1 avatar Apr 19 '22 08:04 cPu1

Fair enough, it could be something that changed within EKS. I did test it already, unfortunatly, there was no AWS magic, and I ended up with about 27 pods. That's how I noticed the issue.

I initially suspected that the script eksctl uses to set max pods for managed nodegroups no longer works in EKS 1.22, potentially because the bootstrap script in 1.22 AMIs has changed. But after testing, I can confirm that eksctl is still able to set maxPods in the kubelet config but it's not being honoured.

I have tracked it down to EKS supplying --max-pods as an argument to the kubelet. The implementation for maxPodsPerNode in eksctl writes the maxPods field to the kubelet config, but EKS is now passing --max-pods as an argument to the kubelet, overriding the field in the kubelet config.

We can also work around this but we'll discuss this with the EKS team first as there were some talks about deprecating max pods earlier.

cPu1 avatar Apr 19 '22 08:04 cPu1

I have tracked it down to EKS supplying --max-pods as an argument to the kubelet. The implementation for maxPodsPerNode in eksctl writes the maxPods field to the kubelet config, but EKS is now passing --max-pods as an argument to the kubelet, overriding the field in the kubelet config.

We can also work around this but we'll discuss this with the EKS team first as there were some talks about deprecating max pods earlier.

Thanks for the update! In the meantime, we could work around the issue by setting resource requests on our pods, instead of setting a hardcoded number of pods. We have been thinking about it for a while anyway, that was just the push we needed to take the time and do it.

mathieu-lemay avatar Apr 19 '22 13:04 mathieu-lemay

@matthewdepietro tagging you here as per your request 👍🏻

Himangini avatar May 03 '22 15:05 Himangini

Adding some context on Managed Nodegroups' behavior - if the VPC CNI is running on >= 1.9, Managed Nodegroups attempts to auto-calculate the value of maxPods and sets it on the kubelet as @cPu1 has found. Managed Nodegroups will look at the different environment variables on the VPC CNI to determine what value to set (it essentially emulates the logic in this calculator script). It takes into account PrefixDelegation, Max ENIs etc.

This logic should only be triggered when the ManagedNodegroups is being created without a custom AMI. When looking to override kubelet config, it's recommended to specify an AMI in the launch template passed to CreateNodegroup since you then get full control over all bootstrap parameters including max pods.

suket22 avatar May 03 '22 17:05 suket22

We need to come up with a plan to support this as cleanly as possible without hacks.

Timebox: 1-2 days Document the outcomes here.

Himangini avatar Jun 07 '22 14:06 Himangini

Looking into this more, a clean solution to support max pods in eksctl is to resolve the AMI using SSM, passing it as a custom AMI to the MNG API, and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set. This approach, however, breaks eksctl upgrade nodegroup and requires eksctl to handle upgrades for nodegroups that have maxPodsPerNode set.

cPu1 avatar Jun 15 '22 12:06 cPu1

Looking into this more, a clean solution to support max pods in eksctl is to resolve the AMI using SSM, passing it as a custom AMI to the MNG API, and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set. This approach, however, breaks eksctl upgrade nodegroup and requires eksctl to handle upgrades for nodegroups that have maxPodsPerNode set.

Alternatively, we can use a workaround/hack that modifies the bootstrap.sh script and removes the --max-pods argument passed in the launch template's user data generated by EKS. This is similar to the how max-pods was implemented previously and requires less effort than the custom AMI approach.

cPu1 avatar Jun 22 '22 10:06 cPu1

and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set

This is the approach I'd be in favor of. --maxPodsPerNode is essentially a property of the kubelet, and the only supported way to modify your kubeletConfiguration is using custom AMIs with your managed nodegroup so this approach makes sense to me.

I'm not sure I understood the mechanics of the workaround you'd mentioned. I think you meant you could edit the bootstrap script on the AMI itself and remove the max-pods argument that MNG API tries to set, but I'm not sure I understand how eksctl would set the value of maxPodsPerNode on the kubelet itself. Lmk what I'm missing here.

In the long term, we've been thinking of rewriting the EKS bootstrap script so that kubelet parameter overrides can be specified within your UserData section, and it'll be honored by however MNG bootstraps, but it's pending resourcing.

suket22 avatar Jul 05 '22 21:07 suket22

Looking into this more, a clean solution to support max pods in eksctl is to resolve the AMI using SSM, passing it as a custom AMI to the MNG API, and use a custom bootstrap script, setting --max-pods to the supplied value, when maxPodsPerNode is set. This approach, however, breaks eksctl upgrade nodegroup and requires eksctl to handle upgrades for nodegroups that have maxPodsPerNode set.

Alternatively, we can use a workaround/hack that modifies the bootstrap.sh script and removes the --max-pods argument passed in the launch template's user data generated by EKS. This is similar to the how max-pods was implemented previously and requires less effort than the custom AMI approach.

I am inclined to this approach as well instead of breaking eksctl upgrade nodegroup

Himangini avatar Jul 06 '22 12:07 Himangini

I'm not sure I understood the mechanics of the workaround you'd mentioned. I think you meant you could edit the bootstrap script on the AMI itself and remove the max-pods argument that MNG API tries to set

Correct.

but I'm not sure I understand how eksctl would set the value of maxPodsPerNode on the kubelet itself. Lmk what I'm missing here.

eksctl will set it in the kubelet config, which will then be read by kubelet.

In the long term, we've been thinking of rewriting the EKS bootstrap script so that kubelet parameter overrides can be specified within your UserData section, and it'll be honored by however https://github.com/awslabs/amazon-eks-ami/pull/875, but it's pending resourcing.

Thanks for sharing this. I think we'll go with the workaround for now, given that we already have a similar workaround in place and it requires less effort than using a custom AMI with a custom bootstrap script. We'll revisit this approach after the EKS bootstrap script starts accepting kubelet parameter overrides.

cPu1 avatar Jul 06 '22 14:07 cPu1

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Aug 06 '22 02:08 github-actions[bot]

Just dumping this for reference:

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/user_data.md#%EF%B8%8F-caveat https://github.com/awslabs/amazon-eks-ami/issues/873 https://github.com/awslabs/amazon-eks-ami/issues/844

Also, maxPodsPerNode does not seem to work with latest 1.21 AMIs anymore (https://github.com/awslabs/amazon-eks-ami/compare/v20220824...v20220914).

I am using this in new created clusters and its still working https://github.com/awslabs/amazon-eks-ami/issues/844#issuecomment-1048592041 (tested on 1.21, 1.22 and 1.23).

EDIT: It seems that is working again in 1.21 for me with ami-051aa0d5889741142 (EKS 1.21/us-east-2) as of 2022/10/07.

bryanasdev000 avatar Sep 27 '22 16:09 bryanasdev000

Just dumping this for reference:

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/user_data.md#%EF%B8%8F-caveat awslabs/amazon-eks-ami#873 awslabs/amazon-eks-ami#844

Also, maxPodsPerNode does not seem to work with latest 1.21 AMIs anymore (awslabs/[email protected]).

I am using this in new created clusters and its still working awslabs/amazon-eks-ami#844 (comment) (tested on 1.21, 1.22 and 1.23).

EDIT: It seems that is working again in 1.21 for me with ami-051aa0d5889741142 (EKS 1.21/us-east-2) as of 2022/10/07.

This was fixed by https://github.com/weaveworks/eksctl/pull/5808. You should not run into this issue with a recent version of eksctl.

cPu1 avatar Dec 01 '22 11:12 cPu1

@cPu1 are you sure that the fix in https://github.com/weaveworks/eksctl/pull/5808/files#diff-3a316f46904258df0dec1e9c9c1d6a89efb06e0637a5c0a6a930c162b5352498R99 is called - the sed appends it in KUBELET_EXTRA_ARGS which is only called if --kubelet-extra-args is passed in?

matti avatar Dec 10 '22 20:12 matti

okay it does set it, but kubelet is running with --max-pods=110 --max-pods=123 where the latter is te maxPodsPerNode value

matti avatar Dec 10 '22 20:12 matti