karpenter
karpenter copied to clipboard
Karpenter simulates node capacity incorrectly causing pod scheduling to fail
Description
Observed Behavior: Karpenter provisions a node that doesn't fit the pending pod and the daemonsets. We have a pending pod with
requests:
memory: 3000Mi
cpu: 1600m
In addition these daemon sets- 1 daemon set (filebeat) with 100Mi memory requests. and other daemon sets that have no requests/limits set.
We see in karpenter logs that its choosing c5.large with 3788Mi capacity -
2023-06-29T09:13:08.744Z INFO controller.provisioner launching machine with 1 pods requesting {"cpu":"1825m","memory":"3100Mi","pods":"6"} from types c6id.8xlarge, r6a.metal, r5a.2xlarge, r5.24xlarge, r5a.xlarge and 327 other(s) {"commit": "698f22f-dirty", "provisioner": "default"}
2023-06-29T09:17:15.295Z INFO controller.provisioner.cloudprovider launched instance {"commit": "698f22f-dirty", "provisioner": "default", "id": "i-0308a361de8d92373", "hostname": "ip-172-31-77-218.ec2.internal", "instance-type": "c5.large", "zone": "us-east-1a", "capacity-type": "spot", "capacity": {"cpu":"2","ephemeral-storage":"30Gi","memory":"3788Mi","pods":"29"}}
once the node becomes ready we see that the allocatable capacity isn't enough for the pending pods that need a sum of 3100Mi for their requests. c5.large has allocatable capacity 3106640Kib == 3033.828125 Mib which is < 3100Mib so the pending pod doesn't get schedueled, the filebeat daemonset does.
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 1930m
ephemeral-storage: 27905944324
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3106640Ki
pods: 29
System Info:
Machine ID: ec28eeb5bc1c12c3ee34e95e7f7ee23b
System UUID: ec28eeb5-bc1c-12c3-ee34-e95e7f7ee23b
Boot ID: e6b622f1-7177-48dd-b244-05d4c89583ba
Kernel Version: 5.4.242-156.349.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.19
Kubelet Version: v1.23.17-eks-0a21954
Kube-Proxy Version: v1.23.17-eks-0a21954
ProviderID: aws:///us-east-1a/i-0308a361de8d92373
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system aws-node-p7kvl 25m (1%) 0 (0%) 0 (0%) 0 (0%) 19m
kube-system ebs-csi-node-t4p5d 0 (0%) 0 (0%) 0 (0%) 0 (0%) 19m
kube-system filebeat-filebeat-97btk 100m (5%) 1 (51%) 100Mi (3%) 200Mi (6%) 19m
kube-system kube-proxy-pzvjl 100m (5%) 0 (0%) 0 (0%) 0 (0%) 19m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Warning FailedScheduling 83s default-scheduler 0/14 nodes are available: 1 node(s) had volume node affinity conflict, 2 Insufficient memory, 2 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity/selector, 6 Insufficient cpu.
Expected Behavior: For karpenter to provision a node with 3100Mib of memory allocatable and not ~3033Mib and for the pending pod to succeed scheduling on it
Reproduction Steps (Please include YAML): pending pod yaml
apiVersion: v1
kind: Pod
metadata:
name: nightly-backend-kafka-kafka-0
namespace: kafka
spec:
containers:
- name: test
image: alpine
command:
- sleep
- 99d
resources:
limits:
cpu: 1600m
memory: 3200Mi
requests:
cpu: 1600m
memory: 3000Mi
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.k8s.aws/instance-category
operator: NotIn
values:
- t
daemonset yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: filebeat-filebeat
namespace: kube-system
labels:
app: filebeat-filebeat
currentNumberScheduled: 11
numberMisscheduled: 0
desiredNumberScheduled: 11
numberReady: 11
observedGeneration: 25
updatedNumberScheduled: 11
numberAvailable: 11
spec:
selector:
matchLabels:
app: filebeat-filebeat
release: filebeat
template:
metadata:
name: filebeat-filebeat
labels:
app: filebeat-filebeat
spec:
containers:
- name: filebeat
image: docker.elastic.co/beats/filebeat:7.12.1
args:
- '-e'
- '-E'
- http.enabled=true
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
resources:
limits:
cpu: '1'
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
priorityClassName: system-node-critical
Versions:
- Chart Version: v0.27.5
- Kubernetes Version (
kubectl version
): 1.23
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
What's your Provisioner look like as well as your karpenter-global-settings
? Karpenter uses a concept called vmMemoryOverheadPercent
since all EC2 instances come with some unknown overhead that is consumed by the OS/fabric layer that can't be known through the API, so we skim some capacity off the top to better estimate what the actual capacity for the instance will be.
Hey @jonathan-innis this is the provisioner
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
labels:
usage: apps-spot
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: [t, c, m, r]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
consolidation
enabled: true
providerRef:
name: default
And karpenter global settings -
data:
aws.clusterEndpoint: ''
aws.clusterName: nightly-backend
aws.defaultInstanceProfile: KarpenterNodeInstanceProfile-nightly-backend
aws.enableENILimitedPodDensity: 'true'
aws.enablePodENI: 'false'
aws.interruptionQueueName: nightly-backend-karpenter
aws.isolatedVPC: 'false'
aws.nodeNameConvention: ip-name
aws.vmMemoryOverheadPercent: '0.075'
batchIdleDuration: 1s
batchMaxDuration: 10s
featureGates.driftEnabled: 'false'
Would you suggest to increase aws.vmMemoryOverheadPercent? Shouldn't karpenter be aware of the overhead which is somewhat fixed per instance type in the default ami?
Would you suggest to increase aws.vmMemoryOverheadPercent
I was able to repro this and I'd recommend to bump this up to a higher value 0.08
as a workaround.
Shouldn't karpenter be aware of the overhead which is somewhat fixed per instance type in the default ami
We are working on improving this rough estimate to be more accurate since you're correct that it does seem to be somewhat fixed per instance. There's an issue that's currently tracking improving this to move it away from a rough percentage: aws/karpenter-core#716
Thank you very much! 🙏🏻
0.08
was not enough for us.
We had to bump it to 0.1
.
was not enough for us
Which instance types are you using that required you to bump it up to 0.1
?
A little bit of everything to be honest. I'm not sure which instance type was causing the flapping, but it stopped. Our provisioner spec's requirement is quite broad:
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
This can happen if you enable hugepages on the worker node. In that case karpenter cannot estimate the available RAM on the newly created node properly, and scheduler fails to run the pod on the new node.
@project-administrator Linking https://github.com/aws/karpenter-core/issues/751 since it has the details of extended resource support for a bunch of different things, including hugepages
.
I'm not as familiar with hugepages and how it affects memory so can you provide an example of how one affects the other in this case?
@jonathan-innis hugepages is a Linux kernel feature which is recommended to be enabled for some memory-consuming products like DBs, java apps, etc. Enabling hugepages for such products (where recommended) usually improves performance.
After enabling transparent huge pages with sysctl there might be some undesired effects as well like this: Suboptimal memory usage: THP may promote smaller pages to huge pages even when it is not beneficial, which can lead to increased memory usage.
I believe this is what happens: usual OS processes suddenly start using much more RAM and karpenter can't no longer estimate the amount of available RAM that newly created EC2 node has after the startup.
For example,
In our case karpenter spins up a bottlerocket-based node with 8Gb of RAM, scheduler is no longer able to fit a workload with requests: memory: 3Gi
.
That's obviously not a good idea to try use such small node with transparent huge pages, but I don't know any way to give karpenter a hint that nodes now have less available RAM in the OS after hugepages are enabled.
Node with THP enabled might have significantly less available RAM than karpenter expects it to have.
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.
/unassign @jonathan-innis
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten