bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

Bug: Max pod and allocatable ram is incorrectly calculated (switch t4g.small from AL2023 to BR, then ram goes from 1437 to 288MB allocatable)

Open neoakris opened this issue 8 months ago • 6 comments

I thought this was a bug at first, but then I RTFM, and saw it was documented behavior, but very unintuitive/unexpected in my opinion.

To use Karpenter you need a MNG with at least 2 baseline nodes. Originally I was using AL2023 as my baseline nodes, but since bottlerocket is theoretically more secure & recommended I figured I'd switch, but when I did I got pending pods, and when I found out why I was like that's odd. Is this a bug, no I guess not? It's documented as to why. so I figured I'd write it as a feature request. But after looking again, I think it's a bug.

What I'm using

  • Kube 1.31
  • aws-cdk to deploy (using the Layer 2 CDK Construct of eks & MNG, not using Layer 3 construct of EKS Blueprints as I've found it to be buggy and unmaintained.)
  • t4g.small (2cpu 2gb ram)
  • Bottlerocket AMI = amazon/bottlerocket-aws-k8s-1.31-aarch64-v1.35.0-af533f46
  • AL2023 AMI = amazon/amazon-eks-node-al2023-arm64-standard-1.31-v20250403

Here's what's unintuitive/unexpected I switched my baseline nodes from AL2023 to bottlerocket and saw I had pending pods, nodes have insufficient memory. and I was like there's no way these nodes are out of ram, but kubectl describe node said just that.

kubectl describe node against t4g.small (2cpu 2gb ram)
AL2023:

Allocatable:
  cpu:                1930m
  memory:             1403648Ki   <--1437MB   (74.2% of node's ram is alloctable to pods)
  pods:               110  <-- odd

BottleRocket:

Allocatable:
  cpu:                1930m
  memory:             282052Ki  <-- 288MB???  (15% of node's ram is alloctable to pods)
  pods:               110  <-- odd

The reason the max pods is odd is https://github.com/aws/amazon-vpc-cni-k8s/blob/master/misc/eni-max-pods.txt Says t4g.small can support a max of 11 pods.

At first I though this was just a feature request because it seems that allocatable ram usage is documented https://bottlerocket.dev/en/os/1.31.x/api/settings/kubernetes/#kube-reserved says memory_to_reserve = max_num_pods * 11 + 255 110*11 + 255 = 1465 ram to reserve, well that explains why I see only 288MB ram available.

But now I think it's a bug t4g.small can support a max of 11 pods so the bug is that the memory_to_reserve is using the wrong value of max_num_pods it's using 110 pods, when it should be using the instance specific max_num_pods documented in this table https://github.com/aws/amazon-vpc-cni-k8s/blob/master/misc/eni-max-pods.txt

so bottlerocket should have calculated 11*11 + 255 = 376MB ram to reserve, which would have lead to a value much similar to AL2023.

neoakris avatar Apr 12 '25 02:04 neoakris

I read this issue then stumbled uppon an old issue where someone said Karpenter defaults to 110 when it doesn't know which instance type it's dealing with:

Karpenter simulates pod scheduling and provisions instances by discovering instance types from EC2 and binpacking pods onto the node. It uses a formula (14 * AttachableENICount) to compute the max pods value. It also binds the pods before the node comes online as an optimization. If Bottlerocket is unaware of a new instance type, it will default to MaxPods of 110, which is woefully short of the actual number of pods that can be scheduled using the AWS VPC CNI.

Note that the value is 11 in the docs for this instance type https://karpenter.sh/docs/reference/instance-types/#resources-791

awoimbee avatar Apr 17 '25 14:04 awoimbee

Since you mentioned Karpenter, there's an important datapoint to mention (that I recently discovered after making this issue.)

  • This issue doesn't show up for karpenter.sh, a bottlerocket t4g.small provisioned by karpenter is works fine, and correctly set max pods at 11.
  • It was AWS CDK Managed Node Group provisioned bottlerocket t4g.small that incorrectly defaulted to 110 max pods, I was able to patch fix it with a userdata config override, to explicitly force max pods to be 11.

neoakris avatar Apr 17 '25 15:04 neoakris

Another consideration is if you enable prefixes on the AWS VPC CNI, it defaults to 110 max pods also Not sure if this is used in this case but thought I would point it out

https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html#cni-increase-ip-addresses-considerations

jaredcdep avatar Apr 19 '25 19:04 jaredcdep

Since you mentioned Karpenter, there's an important datapoint to mention (that I recently discovered after making this issue.)

* This issue doesn't show up for karpenter.sh, a bottlerocket t4g.small provisioned by karpenter is works fine, and correctly set max pods at 11.

* It was AWS CDK Managed Node Group provisioned bottlerocket t4g.small that incorrectly defaulted to 110 max pods, I was able to patch fix it with a userdata config override, to explicitly force max pods to be 11.

I can confirm your observations:

Both nodes are on the same VPC CNI controller with Prefixes enabled Installed with Aws Eks blueprints addons with terraform The one node is a t3a type, but I don't think that affects this

Karpenter node: beta.kubernetes.io/instance-type=t3a.small

Capacity:
  cpu:                2
  ephemeral-storage:  53182Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1935376Ki
  pods:               8
Allocatable:
  cpu:                1930m
  ephemeral-storage:  49115090042
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1481744Ki
  pods:               8

EKS Managed node group node: beta.kubernetes.io/instance-type=t3.small

Capacity:
  cpu:                2
  ephemeral-storage:  31678Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1929232Ki
  pods:               110
Allocatable:
  cpu:                1930m
  ephemeral-storage:  28821369602
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             326672Ki
  pods:               110

I will try set max-pods-per-node in my node group

jaredcdep avatar Apr 19 '25 19:04 jaredcdep

Thought I would mention that this issue https://github.com/bottlerocket-os/bottlerocket/issues/1721 gave a bit of background to the general issue

Specifically this comment https://github.com/bottlerocket-os/bottlerocket/issues/1721#issuecomment-2775407893 TL;DR - Amazon AMIs use a legacy pattern based on the old ENI IP limit and Docker with cgroup v1 as the runtime.

What I ended up doing is add extra boot args to my terraform that fine-tunes the settings for my case (t3a.small):

bootstrap_extra_args = <<-EOT
            [settings.kubernetes]
            "max-pods" = 24
            [settings.kubernetes.kube-reserved]
            cpu = "100m"
            memory = "500Mi"
            ephemeral-storage = "1Gi"
          EOT

https://github.com/bottlerocket-os/bottlerocket/issues/1721#issuecomment-2739761932 is an example of what I may try in the karpenter EC2NodeClass to allow more than the old pod limits (8 pods on a t3.small is very low considering 3-4 of those are pods for the system and vpc/ebs/identity agent etc)

kubelet:
  maxPods: 110
 systemReserved:
    cpu: 100m
    memory: 100Mi
    ephemeral-storage: 1Gi
  kubeReserved:
    cpu: 100m
    memory: 1465Mi
    ephemeral-storage: 1Gi

The only issue is karpenter launches various different instance types, so finding the correct setting for the above will take some discovery

jaredcdep avatar Apr 20 '25 08:04 jaredcdep

not sure if this is related to this thread but we switched to Bottlerocket just yesterday, and suddenly OOM alert spiked to an all-time high. It was running fine in dev, but things went sideways in production. The only fix that worked was allocating more resources.

Image

adiii717 avatar May 26 '25 05:05 adiii717