karpenter Karpenter simulates node capacity incorrectly causing pod scheduling to fail

Description

Observed Behavior: Karpenter provisions a node that doesn't fit the pending pod and the daemonsets. We have a pending pod with

  requests:
    memory: 3000Mi
    cpu: 1600m

In addition these daemon sets- 1 daemon set (filebeat) with 100Mi memory requests. and other daemon sets that have no requests/limits set.

We see in karpenter logs that its choosing c5.large with 3788Mi capacity -

2023-06-29T09:13:08.744Z	INFO	controller.provisioner	launching machine with 1 pods requesting {"cpu":"1825m","memory":"3100Mi","pods":"6"} from types c6id.8xlarge, r6a.metal, r5a.2xlarge, r5.24xlarge, r5a.xlarge and 327 other(s)	{"commit": "698f22f-dirty", "provisioner": "default"}

2023-06-29T09:17:15.295Z	INFO	controller.provisioner.cloudprovider	launched instance	{"commit": "698f22f-dirty", "provisioner": "default", "id": "i-0308a361de8d92373", "hostname": "ip-172-31-77-218.ec2.internal", "instance-type": "c5.large", "zone": "us-east-1a", "capacity-type": "spot", "capacity": {"cpu":"2","ephemeral-storage":"30Gi","memory":"3788Mi","pods":"29"}}

once the node becomes ready we see that the allocatable capacity isn't enough for the pending pods that need a sum of 3100Mi for their requests. c5.large has allocatable capacity 3106640Kib == 3033.828125 Mib which is < 3100Mib so the pending pod doesn't get schedueled, the filebeat daemonset does.

Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         1930m
  ephemeral-storage:           27905944324
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3106640Ki
  pods:                        29
System Info:
  Machine ID:                 ec28eeb5bc1c12c3ee34e95e7f7ee23b
  System UUID:                ec28eeb5-bc1c-12c3-ee34-e95e7f7ee23b
  Boot ID:                    e6b622f1-7177-48dd-b244-05d4c89583ba
  Kernel Version:             5.4.242-156.349.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.19
  Kubelet Version:            v1.23.17-eks-0a21954
  Kube-Proxy Version:         v1.23.17-eks-0a21954
ProviderID:                   aws:///us-east-1a/i-0308a361de8d92373
Non-terminated Pods:          (4 in total)
  Namespace                   Name                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                       ------------  ----------  ---------------  -------------  ---
  kube-system                 aws-node-p7kvl             25m (1%)      0 (0%)      0 (0%)           0 (0%)         19m
  kube-system                 ebs-csi-node-t4p5d         0 (0%)        0 (0%)      0 (0%)           0 (0%)         19m
  kube-system                 filebeat-filebeat-97btk    100m (5%)     1 (51%)     100Mi (3%)       200Mi (6%)     19m
  kube-system                 kube-proxy-pzvjl           100m (5%)     0 (0%)      0 (0%)           0 (0%)         19m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)

  Warning  FailedScheduling  83s   default-scheduler  0/14 nodes are available: 1 node(s) had volume node affinity conflict, 2 Insufficient memory, 2 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity/selector, 6 Insufficient cpu.

Expected Behavior: For karpenter to provision a node with 3100Mib of memory allocatable and not ~3033Mib and for the pending pod to succeed scheduling on it

Reproduction Steps (Please include YAML): pending pod yaml

apiVersion: v1
kind: Pod
metadata:
  name: nightly-backend-kafka-kafka-0
  namespace: kafka
spec:
  containers:
    - name: test
      image: alpine
      command:
        - sleep
        - 99d
      resources:
        limits:
          cpu: 1600m
          memory: 3200Mi
        requests:
          cpu: 1600m
          memory: 3000Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: karpenter.k8s.aws/instance-category
                operator: NotIn
                values:
                  - t

daemonset yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat-filebeat
  namespace: kube-system
  labels:
    app: filebeat-filebeat
  currentNumberScheduled: 11
  numberMisscheduled: 0
  desiredNumberScheduled: 11
  numberReady: 11
  observedGeneration: 25
  updatedNumberScheduled: 11
  numberAvailable: 11
spec:
  selector:
    matchLabels:
      app: filebeat-filebeat
      release: filebeat
  template:
    metadata:
      name: filebeat-filebeat
      labels:
        app: filebeat-filebeat
    spec:
      containers:
        - name: filebeat
          image: docker.elastic.co/beats/filebeat:7.12.1
          args:
            - '-e'
            - '-E'
            - http.enabled=true
          env:
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
          resources:
            limits:
              cpu: '1'
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 100Mi
          priorityClassName: system-node-critical

Versions:

Chart Version: v0.27.5
Kubernetes Version (kubectl version): 1.23

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Jun 29 '23 10:06 itaib-AF

What's your Provisioner look like as well as your karpenter-global-settings? Karpenter uses a concept called vmMemoryOverheadPercent since all EC2 instances come with some unknown overhead that is consumed by the OS/fabric layer that can't be known through the API, so we skim some capacity off the top to better estimate what the actual capacity for the instance will be.

Jun 29 '23 18:06 jonathan-innis

Hey @jonathan-innis this is the provisioner

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  labels:
    usage: apps-spot
  requirements:
    - key: karpenter.k8s.aws/instance-category
      operator: In
      values: [t, c, m, r]
    - key: karpenter.k8s.aws/instance-generation
      operator: Gt
      values: ["2"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["amd64"]
    - key: "karpenter.sh/capacity-type" 
      operator: In
      values: ["spot", "on-demand"]
    - key: kubernetes.io/os	
      operator: In	
      values:	["linux"]    
  consolidation
    enabled: true       
  providerRef:  
    name: default

And karpenter global settings -

data:
  aws.clusterEndpoint: ''
  aws.clusterName: nightly-backend
  aws.defaultInstanceProfile: KarpenterNodeInstanceProfile-nightly-backend
  aws.enableENILimitedPodDensity: 'true'
  aws.enablePodENI: 'false'
  aws.interruptionQueueName: nightly-backend-karpenter
  aws.isolatedVPC: 'false'
  aws.nodeNameConvention: ip-name
  aws.vmMemoryOverheadPercent: '0.075'
  batchIdleDuration: 1s
  batchMaxDuration: 10s
  featureGates.driftEnabled: 'false'

Would you suggest to increase aws.vmMemoryOverheadPercent? Shouldn't karpenter be aware of the overhead which is somewhat fixed per instance type in the default ami?

Jul 03 '23 07:07 itaib-AF

Would you suggest to increase aws.vmMemoryOverheadPercent

I was able to repro this and I'd recommend to bump this up to a higher value 0.08 as a workaround.

Shouldn't karpenter be aware of the overhead which is somewhat fixed per instance type in the default ami

We are working on improving this rough estimate to be more accurate since you're correct that it does seem to be somewhat fixed per instance. There's an issue that's currently tracking improving this to move it away from a rough percentage: aws/karpenter-core#716

Jul 03 '23 18:07 jonathan-innis

Thank you very much! 🙏🏻

Jul 03 '23 18:07 itaib-AF

0.08 was not enough for us. We had to bump it to 0.1.

Jul 13 '23 10:07 mamoit

was not enough for us

Which instance types are you using that required you to bump it up to 0.1?

Jul 18 '23 18:07 jonathan-innis

A little bit of everything to be honest. I'm not sure which instance type was causing the flapping, but it stopped. Our provisioner spec's requirement is quite broad:

requirements:
- key: "karpenter.k8s.aws/instance-category"
  operator: In
  values: ["c", "m", "r"]

Jul 25 '23 13:07 mamoit

This can happen if you enable hugepages on the worker node. In that case karpenter cannot estimate the available RAM on the newly created node properly, and scheduler fails to run the pod on the new node.

Nov 10 '23 07:11 project-administrator

@project-administrator Linking https://github.com/aws/karpenter-core/issues/751 since it has the details of extended resource support for a bunch of different things, including hugepages.

I'm not as familiar with hugepages and how it affects memory so can you provide an example of how one affects the other in this case?

Nov 10 '23 15:11 jonathan-innis

@jonathan-innis hugepages is a Linux kernel feature which is recommended to be enabled for some memory-consuming products like DBs, java apps, etc. Enabling hugepages for such products (where recommended) usually improves performance. After enabling transparent huge pages with sysctl there might be some undesired effects as well like this: Suboptimal memory usage: THP may promote smaller pages to huge pages even when it is not beneficial, which can lead to increased memory usage. I believe this is what happens: usual OS processes suddenly start using much more RAM and karpenter can't no longer estimate the amount of available RAM that newly created EC2 node has after the startup. For example, In our case karpenter spins up a bottlerocket-based node with 8Gb of RAM, scheduler is no longer able to fit a workload with requests: memory: 3Gi. That's obviously not a good idea to try use such small node with transparent huge pages, but I don't know any way to give karpenter a hint that nodes now have less available RAM in the OS after hugepages are enabled. Node with THP enabled might have significantly less available RAM than karpenter expects it to have.

Nov 11 '23 11:11 project-administrator

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

Dec 14 '23 12:12 github-actions[bot]

/unassign @jonathan-innis

Jan 23 '24 19:01 billrayburn

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 22 '24 19:04 k8s-triage-robot

/remove-lifecycle stale

May 14 '24 15:05 druchoo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 12 '24 15:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 11 '24 15:09 k8s-triage-robot

karpenter karpenter copied to clipboard

Karpenter simulates node capacity incorrectly causing pod scheduling to fail

Description

karpenter
karpenter copied to clipboard