bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

Use eni-max-pods mapping file for setting of the max pods value

Open ubasche-nex opened this issue 2 years ago • 11 comments

What I'd like: It is possible to set the static max_pods to a customised value. But it doesn't cater for node groups with instance types that need different settings for the max_pods value. It should be possible to set max_pods dynamically based on the values in the eni-max-pods mapping file.

There should be a configuration so that the max_pods value for an instance gets derived from https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt according to the instance type.

ubasche-nex avatar Sep 13 '23 10:09 ubasche-nex

There should be a configuration so that the max_pods value for an instance gets derived from https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt according to the instance type.

Hi @ubasche-nex. By default, the max_pods does use the same eni-max-pods values based on the instance type.

If needed, you can set the settings.kubernetes.max-pods setting to meet your specific needs. But unless there is a specific need, it is recommended to leave that unset so the instance type-based defaults are used.

stmcginnis avatar Sep 13 '23 11:09 stmcginnis

Hi Sean,

When I check on our system, the eni-max-pods mapping file is used by nodes that are deployed by Karpenter but not by nodes that are part of a node group. The node group deployed instances are using the default value of 110 instead of the value 58 from the mappings file.

% kubectl get nodes -o yaml | yq '.items[] | .metadata.name + "," + .metadata.labels."provisioner" + "," + .metadata.labels."node.kubernetes.io/instance-type" + "," + .status.capacity."pods"'
| sort -t',' -k4 | column -s',' -t ip-10-150-4-99.eu-west-1.compute.internal managed-node-group c5.xlarge 110 ip-10-150-6-119.eu-west-1.compute.internal managed-node-group c5.xlarge 110 ip-10-150-8-17.eu-west-1.compute.internal managed-node-group c5.xlarge 110 ip-10-150-6-243.eu-west-1.compute.internal karpenter c5.4xlarge 234 ip-10-150-6-115.eu-west-1.compute.internal karpenter c5.large 29 ip-10-150-6-188.eu-west-1.compute.internal karpenter c5.large 29 ip-10-150-7-122.eu-west-1.compute.internal karpenter c5.large 29 ip-10-150-6-195.eu-west-1.compute.internal karpenter m5.xlarge 58 ip-10-150-6-95.eu-west-1.compute.internal karpenter m5.xlarge 58 ip-10-150-7-130.eu-west-1.compute.internal karpenter m5.xlarge 58 ip-10-150-7-79.eu-west-1.compute.internal karpenter m5.xlarge 58 ip-10-150-7-90.eu-west-1.compute.internal karpenter m5.xlarge 58 ip-10-150-7-93.eu-west-1.compute.internal karpenter m5.xlarge 58

Extract from https://github.com/bottlerocket-os/bottlerocket/blob/develop/packages/os/eni-max-pods

ubasche-nex avatar Sep 13 '23 13:09 ubasche-nex

Hmm, I wonder if managed node groups defaults to passing in this value, causing the Bottlerocket default to be overwritten.

I'll try spinning up some hosts and see if I can tell what's happening. In the meantime, if you are able to connect to one of those hosts, if you could run apiclient get settings.kubernetes and see if the max-pods setting is there, that would indicate that setting is being set somewhere.

stmcginnis avatar Sep 13 '23 13:09 stmcginnis

Also wondering if you could share how you are deploying these managed node group instances. If using eksctl and a config file, can you confirm maxPodsPerNode was not present in the config?

stmcginnis avatar Sep 13 '23 14:09 stmcginnis

I am not able to reproduce this using eksctl. I created a managed node group with Bottlerocket using this config:

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: br127
  region: us-east-2
  version: '1.27'
managedNodeGroups:
  - name: mng1
    instanceType: c5.xlarge
    minSize: 1
    maxSize: 2
    desiredCapacity: 1
    amiFamily: Bottlerocket
    labels: { role: br-worker }
    tags:
      nodegroup-type: Bottlerocket
    disableIMDSv1: true
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    bottlerocket:
      settings:
        motd: "Hello from eksctl!"

Then ran eksctl create cluster -f config.yaml. Then checking the node settings:

$ kubectl get nodes -o yaml | yq '.items[] | .metadata.name + "," + .status.capacity.pods'  
ip-192-168-10-236.us-east-2.compute.internal,58

The node has the default value of 58.

Just as a sanity check, scaled up the node group to have 4 instances. Still seems happy though:

$ kubectl get nodes -o yaml | yq '.items[] | .metadata.name + "," + .status.capacity.pods' 
ip-192-168-10-236.us-east-2.compute.internal,58
ip-192-168-33-170.us-east-2.compute.internal,58
ip-192-168-40-116.us-east-2.compute.internal,58
ip-192-168-87-25.us-east-2.compute.internal,58

I did notice something interesting. Managed node groups use an auto scaling group with a launch template to manage the instances in the node group. You can see this by going to the AWS EC2 console and selecting Auto Scaling Groups in the bottom left. Selecting the auto scaling group should show the Details tab in the main content area on the right. There is a section called "Launch Template" that will have a link to the launch template being used by this group. If you select the template and go to the Advanced Details tab, scrolling down you can see the "User data" content.

In my generated user data, MNG is actually setting settings.kubernetes.max-pods = 58 explicitly. In this case, that also happens to be the Bottlerocket default so it is just redundant. But if the wrong value is being set there, it will override the OS default for the instance type.

This looks like a bug in MNG to me. The max-pods value should not be included in the user data unless there is a specific need to override the defaults of the OS. It may be worth opening an AWS support ticket to engage this team requesting this setting be removed. I can try to find someone to help, but an official ticket from a customer will probably bear a little more weight. ;)

The somewhat good news is that you can modify the launch template to drop this setting yourself. You would need to create a new version of the template with this line removed from the user data, then set the auto scaling group to use this new revision.

That won't change existing instances, so you may need to migrate your hosts so new instances are spun up with (or in this case, without) the new settings, then have the old instances removed. It's unfortunate this is a few manual steps to correct, but the good news is that once this launch template is updated, you shouldn't need to touch it again.

stmcginnis avatar Sep 13 '23 16:09 stmcginnis

It occurred to be this could be coming from the karpenter settings and not MNG itself. Can you double check the arguments and templates being used to launch these nodes with karpenter. Looking through their docs and some of the source, there are a few places where the examples show passing this setting with a value of 110. I wonder if something inadvertently got copy/pasted that is causing this value to be passed in by Karpenter.

stmcginnis avatar Sep 16 '23 20:09 stmcginnis

All the nodes started by Karpenter have the max-pods value that is derived from the instance-type. The only nodes that have the 110 values are the ones that are not started by Karpenter but are started using a manged node group instead.

ubasche-nex avatar Sep 19 '23 07:09 ubasche-nex

OK, I was just curious if you had deployed managed node groups using karpenter. Basically, just looking for extra data to see if it is specifically with managed nodegroups, or if maybe eksctl was the source of the extra setting in the user data.

stmcginnis avatar Sep 19 '23 11:09 stmcginnis

Came across this issues and I do confirm that we experience the same issue. For the same AMI version of Bottlerocket OS (1.15 now) the Karpenter provisioner sets the max pods value in UserData based on instance type. Contrary, the Managed Node Group (same AMI) we do not pass any max pods value and it pick up 110.

Constantin07 avatar Oct 16 '23 21:10 Constantin07

Hello everyone, I faced this same issue to change settings.kubernetes.max-pods through terraform and I could find the answer for this question in this page Issue with allow-unsafe-sysctls kubernetes setting, there's a mention to the bottlerocket_user_data.tpl file that concatenates the [settings.kubernetes] in the beginning of the file, so I just added the parameter "max-pods" = 50 at the top of the bootstrap_extra_args and it worked. Hope it help somebody with this same scenario.

aleonello avatar Jan 10 '24 18:01 aleonello

This looks like a bug in MNG to me. The max-pods value should not be included in the user data unless there is a specific need to override the defaults of the OS. It may be worth opening an AWS support ticket to engage this team requesting this setting be removed. I can try to find someone to help, but an official ticket from a customer will probably bear a little more weight. ;)

Just to pile onto this, it happens if you deploy the MNG via native Cloudformation as well. I am using a LaunchTemplate in mine, but with no explicit max-pods setting, so it appears to be coming from upstream inside the MNG magic.

redterror avatar Jun 30 '24 16:06 redterror