Bug: Max pod and allocatable ram is incorrectly calculated (switch t4g.small from AL2023 to BR, then ram goes from 1437 to 288MB allocatable)
I thought this was a bug at first, but then I RTFM, and saw it was documented behavior, but very unintuitive/unexpected in my opinion.
To use Karpenter you need a MNG with at least 2 baseline nodes. Originally I was using AL2023 as my baseline nodes, but since bottlerocket is theoretically more secure & recommended I figured I'd switch, but when I did I got pending pods, and when I found out why I was like that's odd. Is this a bug, no I guess not? It's documented as to why. so I figured I'd write it as a feature request. But after looking again, I think it's a bug.
What I'm using
- Kube 1.31
- aws-cdk to deploy (using the Layer 2 CDK Construct of eks & MNG, not using Layer 3 construct of EKS Blueprints as I've found it to be buggy and unmaintained.)
- t4g.small (2cpu 2gb ram)
- Bottlerocket AMI = amazon/bottlerocket-aws-k8s-1.31-aarch64-v1.35.0-af533f46
- AL2023 AMI = amazon/amazon-eks-node-al2023-arm64-standard-1.31-v20250403
Here's what's unintuitive/unexpected I switched my baseline nodes from AL2023 to bottlerocket and saw I had pending pods, nodes have insufficient memory. and I was like there's no way these nodes are out of ram, but kubectl describe node said just that.
kubectl describe node against t4g.small (2cpu 2gb ram)
AL2023:
Allocatable:
cpu: 1930m
memory: 1403648Ki <--1437MB (74.2% of node's ram is alloctable to pods)
pods: 110 <-- odd
BottleRocket:
Allocatable:
cpu: 1930m
memory: 282052Ki <-- 288MB??? (15% of node's ram is alloctable to pods)
pods: 110 <-- odd
The reason the max pods is odd is https://github.com/aws/amazon-vpc-cni-k8s/blob/master/misc/eni-max-pods.txt Says t4g.small can support a max of 11 pods.
At first I though this was just a feature request because it seems that allocatable ram usage is documented
https://bottlerocket.dev/en/os/1.31.x/api/settings/kubernetes/#kube-reserved
says memory_to_reserve = max_num_pods * 11 + 255
110*11 + 255 = 1465 ram to reserve, well that explains why I see only 288MB ram available.
But now I think it's a bug t4g.small can support a max of 11 pods so the bug is that the memory_to_reserve is using the wrong value of max_num_pods it's using 110 pods, when it should be using the instance specific max_num_pods documented in this table https://github.com/aws/amazon-vpc-cni-k8s/blob/master/misc/eni-max-pods.txt
so bottlerocket should have calculated 11*11 + 255 = 376MB ram to reserve, which would have lead to a value much similar to AL2023.
I read this issue then stumbled uppon an old issue where someone said Karpenter defaults to 110 when it doesn't know which instance type it's dealing with:
Karpenter simulates pod scheduling and provisions instances by discovering instance types from EC2 and binpacking pods onto the node. It uses a formula (14 * AttachableENICount) to compute the max pods value. It also binds the pods before the node comes online as an optimization. If Bottlerocket is unaware of a new instance type, it will default to MaxPods of 110, which is woefully short of the actual number of pods that can be scheduled using the AWS VPC CNI.
Note that the value is 11 in the docs for this instance type https://karpenter.sh/docs/reference/instance-types/#resources-791
Since you mentioned Karpenter, there's an important datapoint to mention (that I recently discovered after making this issue.)
- This issue doesn't show up for karpenter.sh, a bottlerocket t4g.small provisioned by karpenter is works fine, and correctly set max pods at 11.
- It was AWS CDK Managed Node Group provisioned bottlerocket t4g.small that incorrectly defaulted to 110 max pods, I was able to patch fix it with a userdata config override, to explicitly force max pods to be 11.
Another consideration is if you enable prefixes on the AWS VPC CNI, it defaults to 110 max pods also Not sure if this is used in this case but thought I would point it out
https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html#cni-increase-ip-addresses-considerations
Since you mentioned Karpenter, there's an important datapoint to mention (that I recently discovered after making this issue.)
* This issue doesn't show up for karpenter.sh, a bottlerocket t4g.small provisioned by karpenter is works fine, and correctly set max pods at 11. * It was AWS CDK Managed Node Group provisioned bottlerocket t4g.small that incorrectly defaulted to 110 max pods, I was able to patch fix it with a userdata config override, to explicitly force max pods to be 11.
I can confirm your observations:
Both nodes are on the same VPC CNI controller with Prefixes enabled Installed with Aws Eks blueprints addons with terraform The one node is a t3a type, but I don't think that affects this
Karpenter node: beta.kubernetes.io/instance-type=t3a.small
Capacity:
cpu: 2
ephemeral-storage: 53182Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1935376Ki
pods: 8
Allocatable:
cpu: 1930m
ephemeral-storage: 49115090042
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1481744Ki
pods: 8
EKS Managed node group node: beta.kubernetes.io/instance-type=t3.small
Capacity:
cpu: 2
ephemeral-storage: 31678Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1929232Ki
pods: 110
Allocatable:
cpu: 1930m
ephemeral-storage: 28821369602
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 326672Ki
pods: 110
I will try set max-pods-per-node in my node group
Thought I would mention that this issue https://github.com/bottlerocket-os/bottlerocket/issues/1721 gave a bit of background to the general issue
Specifically this comment https://github.com/bottlerocket-os/bottlerocket/issues/1721#issuecomment-2775407893
TL;DR - Amazon AMIs use a legacy pattern based on the old ENI IP limit and Docker with cgroup v1 as the runtime.
What I ended up doing is add extra boot args to my terraform that fine-tunes the settings for my case (t3a.small):
bootstrap_extra_args = <<-EOT
[settings.kubernetes]
"max-pods" = 24
[settings.kubernetes.kube-reserved]
cpu = "100m"
memory = "500Mi"
ephemeral-storage = "1Gi"
EOT
https://github.com/bottlerocket-os/bottlerocket/issues/1721#issuecomment-2739761932 is an example of what I may try in the karpenter EC2NodeClass to allow more than the old pod limits (8 pods on a t3.small is very low considering 3-4 of those are pods for the system and vpc/ebs/identity agent etc)
kubelet:
maxPods: 110
systemReserved:
cpu: 100m
memory: 100Mi
ephemeral-storage: 1Gi
kubeReserved:
cpu: 100m
memory: 1465Mi
ephemeral-storage: 1Gi
The only issue is karpenter launches various different instance types, so finding the correct setting for the above will take some discovery
not sure if this is related to this thread but we switched to Bottlerocket just yesterday, and suddenly OOM alert spiked to an all-time high. It was running fine in dev, but things went sideways in production. The only fix that worked was allocating more resources.