karpenter fix: accurately track allocatable resources for nodes

fix: accurately track allocatable resources for nodes

Open BEvgeniyS opened this issue 7 months ago • 10 comments

Fixes https://github.com/aws/karpenter-provider-aws/issues/5161

Description The current method of assuming allocatable memory by simply discarding a percentage of usable memory using the VM_MEMORY_OVERHEAD_PERCENT global variable is suboptimal. There is no value that would avoid both over- and underestimating of memory allocatable.

Cluster-autoscaler addresses this issue by learning about the true allocatable memory from actual nodes and retaining that information. In this pull request, I'm applying the same concept. In this pull request, I'm applying the same concept.

To demonstrate the issue:

Set VM_MEMORY_OVERHEAD_PERCENT to 0
Create a nodepool with a single instance type:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: approaching-allocatable-nodepool-0
spec:
  limits:
    cpu: "18"
    memory: 36Gi
  template:
    metadata:
      labels:
        approaching-allocatable: nodepool-0
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: approaching-allocatable-nodeclass-0
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t4g.medium
      taints:
      - effect: NoExecute
        key:  approaching-allocatable
        value: "nodepool-0"
      kubelet:
        systemReserved:
          memory: "1Ki"
        kubeReserved:
          memory: "1Ki"
        evictionHard:
          memory.available: "1Ki"

Create a workload with request close to node's allocatable:

apiVersion: v1
kind: Pod
metadata:
  name: approaching-allocatable-pod
  namespace: default
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: approaching-allocatable
            operator: In
            values:
            - nodepool-0
  containers:
  - image: public.ecr.aws/eks-distro/kubernetes/pause@sha256:c2518f6d82392ba799d551398805aaa7af70548015263d962afe9710c0eaa1b2
    name: trigger-pod
    resources:
      requests:
        cpu: 10m
        memory: 3686Mi
  tolerations:
  - effect: NoExecute
    key: approaching-allocatable
    operator: Equal
    value: nodepool-0

Observed behaviors

Resolving Resource Overestimation:
- v0.37.0 behavior: Karpenter continuously creates and consolidates nodes without realizing the impossibility of fitting the workload.
- Patched behavior: Accurately tracks actual allocatable resources, preventing the endless loop of node creation and consolidation.
Addressing Resource Underestimation:
- v0.37.0 behavior: Karpenter leaves pods pending indefinitely or chooses an instance type larger than necessary, failing to learn from actual node allocatables when launched for other reasons.
- Patched behavior: Remembers true allocatable resources if a node is ever launched, enabling correct node launches for previously pending pods.
Avoiding Extra Churn:
- v0.37.0 behavior: Incorrect predicted allocatable resources during consolidation lead to unnecessary churn.
- Patched behavior: Scheduling simulations benefit from knowledge about true allocatable resources

The above improvements are implemented using a shared cache that can be accessed from:

lifecycle package: to populate the cache as soon as a node is registered.
scheduling package: to use real allocatable resources when making itFits decisions from the cache, if available.
hash package: to flush the cache for a nodepool after an update.

I tried to avoid introducing a global-like package, but placing the cache in any of the above packages (or others) introduces more coupling between those packages. If there is a definitive place for such a cache, please let me know.

How was this change tested? For overestimation: I ran this in one of our preprod EKS cluster with vmMemoryOverheadPercent=0, and it correctly stops re-launching the nodes of a given nodepool-instancetype combination after the first attempt fails. It also uses the correct allocatable memory for scheduling.

For underestimation: The test was to

set high VM_MEMORY_OVERHEAD_PERCENT value (like 0.2)
Run the workload that was fitting before, observe it's pending
Adding another workload for same nodepool, but with lower request. That launches the real node
Another node would launch for the pod from step 2, and new pods with same requests are now correctly cause new node to be launched

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Jul 15 '24 12:07 BEvgeniyS

karpenter karpenter copied to clipboard

fix: accurately track allocatable resources for nodes

Observed behaviors

karpenter
karpenter copied to clipboard