karpenter
karpenter copied to clipboard
fix: accurately track allocatable resources for nodes
Fixes https://github.com/aws/karpenter-provider-aws/issues/5161
Description
The current method of assuming allocatable memory by simply discarding a percentage of usable memory using the VM_MEMORY_OVERHEAD_PERCENT
global variable is suboptimal. There is no value that would avoid both over- and underestimating of memory allocatable.
Cluster-autoscaler addresses this issue by learning about the true allocatable memory from actual nodes and retaining that information. In this pull request, I'm applying the same concept. In this pull request, I'm applying the same concept.
To demonstrate the issue:
- Set
VM_MEMORY_OVERHEAD_PERCENT
to 0 - Create a nodepool with a single instance type:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: approaching-allocatable-nodepool-0
spec:
limits:
cpu: "18"
memory: 36Gi
template:
metadata:
labels:
approaching-allocatable: nodepool-0
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: approaching-allocatable-nodeclass-0
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- t4g.medium
taints:
- effect: NoExecute
key: approaching-allocatable
value: "nodepool-0"
kubelet:
systemReserved:
memory: "1Ki"
kubeReserved:
memory: "1Ki"
evictionHard:
memory.available: "1Ki"
- Create a workload with request close to node's allocatable:
apiVersion: v1
kind: Pod
metadata:
name: approaching-allocatable-pod
namespace: default
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: approaching-allocatable
operator: In
values:
- nodepool-0
containers:
- image: public.ecr.aws/eks-distro/kubernetes/pause@sha256:c2518f6d82392ba799d551398805aaa7af70548015263d962afe9710c0eaa1b2
name: trigger-pod
resources:
requests:
cpu: 10m
memory: 3686Mi
tolerations:
- effect: NoExecute
key: approaching-allocatable
operator: Equal
value: nodepool-0
Observed behaviors
-
Resolving Resource Overestimation:
- v0.37.0 behavior: Karpenter continuously creates and consolidates nodes without realizing the impossibility of fitting the workload.
- Patched behavior: Accurately tracks actual allocatable resources, preventing the endless loop of node creation and consolidation.
-
Addressing Resource Underestimation:
- v0.37.0 behavior: Karpenter leaves pods pending indefinitely or chooses an instance type larger than necessary, failing to learn from actual node allocatables when launched for other reasons.
- Patched behavior: Remembers true allocatable resources if a node is ever launched, enabling correct node launches for previously pending pods.
-
Avoiding Extra Churn:
- v0.37.0 behavior: Incorrect predicted allocatable resources during consolidation lead to unnecessary churn.
- Patched behavior: Scheduling simulations benefit from knowledge about true allocatable resources
The above improvements are implemented using a shared cache that can be accessed from:
- lifecycle package: to populate the cache as soon as a node is registered.
-
scheduling package: to use real allocatable resources when making
itFits
decisions from the cache, if available. - hash package: to flush the cache for a nodepool after an update.
I tried to avoid introducing a global-like package, but placing the cache in any of the above packages (or others) introduces more coupling between those packages. If there is a definitive place for such a cache, please let me know.
How was this change tested?
For overestimation:
I ran this in one of our preprod EKS cluster with vmMemoryOverheadPercent=0
, and it correctly stops re-launching the nodes of a given nodepool-instancetype combination after the first attempt fails. It also uses the correct allocatable memory for scheduling.
For underestimation: The test was to
- set high
VM_MEMORY_OVERHEAD_PERCENT
value (like 0.2) - Run the workload that was fitting before, observe it's pending
- Adding another workload for same nodepool, but with lower request. That launches the real node
- Another node would launch for the pod from step 2, and new pods with same requests are now correctly cause new node to be launched
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.