karpenter-provider-aws icon indicating copy to clipboard operation
karpenter-provider-aws copied to clipboard

Accelerated GPU instance NodePool definition yields error "no instance type satisfied resources"

Open csm-kb opened this issue 5 months ago • 1 comments

Description

Context:

Hey! I have Karpenter deployed very neatly to an EKS cluster using FluxCD to automatically manage Helm charts:

(click to expand) Helm release for Karpenter
# including HelmRepository here, even though it is in a separate file
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: karpenter
  namespace: flux-system
spec:
  type: "oci"
  url: oci://public.ecr.aws/karpenter
  interval: 30m
---
apiVersion: v1
kind: Namespace
metadata:
  name: karpenter
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: karpenter-crd
  namespace: karpenter
spec:
  interval: 5m
  chart:
    spec:
      chart: karpenter-crd
      version: ">=1.0.0 <2.0.0"
      sourceRef:
        kind: HelmRepository
        name: karpenter
        namespace: flux-system
  install:
    remediation:
      retries: 3
  values:
    webhook:
      enabled: true
      serviceName: karpenter
      serviceNamespace: karpenter
      port: 8443
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: karpenter
  namespace: karpenter
spec:
  interval: 5m
  chart:
    spec:
      chart: karpenter
      version: ">=1.0.0 <2.0.0"
      sourceRef:
        kind: HelmRepository
        name: karpenter
        namespace: flux-system
  install:
    remediation:
      retries: 3
  values:
    webhook:
      enabled: true
      port: 8443
    replicas: 2
    logLevel: debug
    controller:
      resources:
        requests:
          cpu: 1
          memory: 1Gi
        limits:
          cpu: 1
          memory: 1Gi
    settings:
      clusterName: "bench-cluster"
      interruptionQueue: "Karpenter-bench-cluster"
    serviceAccount:
      create: true
      annotations:
        eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>:role/KarpenterController-20240815204005347400000005"

I then have three NodePools (and associated EC2NodeClasses) that take different workloads, depending on what pods get launched with what affinities/taints to request where they go. The two NodePools that rely on normal compute instance types like C/M/R work very well, and Karpenter works flawlessly to scale the node pools and serve those pods!

However...

Observed Behavior:

The third NodePool is for workloads that require a G instance with NVIDIA compute to run.

Simple enough, right? YAML:

(click to expand) Karpenter resource definition YAML
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: ep-nodeclass
spec:
  amiFamily: AL2
  role: "bench-main-ng-eks-node-group-20240620210345707900000001"
  subnetSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  securityGroupSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  amiSelectorTerms:
    # acquired from https://github.com/awslabs/amazon-eks-ami/releases
    - name: "amazon-eks-gpu-node-1.30-v*"
  kubelet:
    maxPods: 1
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: ep-base
spec:
  template:
    metadata:
      labels:
        example.com/taint-ep-base: "true"
      annotations:
        Env: "staging"
        Project: "autotest"
    spec:
      taints:
      - key: example.com/taint-ep-base
        effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        # - key: node.kubernetes.io/instance-type
        #   operator: In
        #   values: ["g5.2xlarge", "g6.2xlarge"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g5", "g6"]
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: In
          values: ["1"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: ep-nodeclass
      expireAfter: 168h # 7 * 24h = 168h
  limits:
    cpu: 64
    memory: 256Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

The CPU and memory limits are set just as the others are, and leave plenty of room for the G instance specs based on the docs.

This is defined identically to the other functional NodePools, except for the G instance family specifications (particularly the newer card offerings).

When Karpenter takes this in, and I launch a pod with the necessary Kubernetes specs:

# an Argo Workflow launches this pod
      resources: # not required; I have tried this without this subset
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
    affinity: # standard; works flawlessly to route pods to the tc-base and tc-heavy NodePools
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: "{{workflow.parameters.ep_pod_tolerance}}"
              operator: "Exists"

It validates it successfully and attempts to spin up a node to serve it... to yield the following:

(click to expand) kubectl logs output of JSON, formatted
{
    "level": "DEBUG",
    "time": "2024-08-26T21:51:35.612Z",
    "logger": "controller",
    "caller": "scheduling/scheduler.go:220",
    "message": "226 out of 801 instance types were excluded because they would breach limits",
    "commit": "62a726c",
    "controller": "provisioner",
    "namespace": "",
    "name": "",
    "reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238",
    "NodePool": {
        "name": "ep-base"
    }
}
{
    "level": "ERROR",
    "time": "2024-08-26T21:51:35.618Z",
    "logger": "controller",
    "caller": "provisioning/provisioner.go:355",
    "message": "could not schedule pod",
    "commit": "62a726c",
    "controller": "provisioner",
    "namespace": "",
    "name": "",
    "reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238",
    "Pod": {
        "name": "e2e-test-stage-kane-p7wck-edge-pipeline-pickle-2973982407",
        "namespace": "argo"
    },
    "error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule; incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule; incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [g5.2xlarge g6.2xlarge], example.com/taint-ep-base In [true] (no instance type has enough resources)",
    "errorCauses": [{
            "error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule"
        }, {
            "error": "incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule"
        }, {
            "error": "incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-family In [g5 g6], example.com/taint-ep-base In [true] (no instance type has enough resources)"
        }
    ]
}

The scheduler does checks to filter instance types on available limit overhead -- but no matter what set of configs I try, the provisioner just refuses to take without being more explicit about what resources are missing from the instance types it can see (even though the instance types desired very much support the small resource requirements it is reporting on).

Notes and things I have tirelessly tried to get around this:

  • The EKS cluster and all associated infrastructure (managed by Terraform) is running in us-east-1.
  • I made sure there was plenty of overhead G instance quota in the AWS account to support this operation.
  • The run above had a workload pod not specify the nvidia.com/gpu resource requirement to try and rule that out, to no avail.
  • I loosened the requirements to any G instance type and removed instance-gpu-count`; observed the same issue.
  • I increased the resource limits to cpu: 1000 and memory: 1024Gi (and removed the nvidia.com/gpu limit I had at one point) and watched the filtered instance type count decrease; observed the same issue.
  • I modified/generified the EC2NodeClass AMI selection to use the latest version of any of the three supported types (AL2, AL2023, Bottlerocket) one after another; observed the same issue.
  • I drilled into Karpenter source to understand where exactly this error is emitted, but did not modify it to obtain visibility into which of the 801 instance types the controller had discarded for consideration.
  • I made sure the nvidia-k8s-device-plugin was provisioned and active to the latest version in kube-system namespace (there is a manual EC2 G-instance node group in this cluster that is also used for active workloads as a live workaround).

Expected Behavior:

One of two things:

  1. The Karpenter controller provides observability into which instance types it filtered for resource selection and why, in order to correlate + disambiguate the provided error above as a config error, internal bug, or cluster bug.
  2. The Karpenter controller selects a G instance type (like a nice g4dn/g5/g6.2xlarge) to spawn, do so, then assign the pod to the node and let it work its magic.

Reproduction Steps (Please include YAML):

  1. Deploy below Karpenter resources YAML to a clean Karpenter-enabled EKS cluster (latest: v1.0.1):
(click to expand) Karpenter resources YAML
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: ep-nodeclass
spec:
  amiFamily: AL2
  role: "bench-main-ng-eks-node-group-20240620210345707900000001"
  subnetSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  securityGroupSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  amiSelectorTerms:
    # acquired from https://github.com/awslabs/amazon-eks-ami/releases
    - name: "amazon-eks-gpu-node-1.30-v*"
  kubelet:
    maxPods: 1
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: ep-base
spec:
  template:
    metadata:
      labels:
        example.com/taint-ep-base: "true"
      annotations:
        Env: "staging"
        Project: "autotest"
    spec:
      taints:
      - key: example.com/taint-ep-base
        effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        # - key: node.kubernetes.io/instance-type
        #   operator: In
        #   values: ["g5.2xlarge", "g6.2xlarge"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g5", "g6"]
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: In
          values: ["1"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: ep-nodeclass
      expireAfter: 168h # 7 * 24h = 168h
  limits:
    cpu: 64
    memory: 256Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
  1. Attempt to run any test pod with the following Kubernetes requirements:
(click to expand) YAML subset for requirements
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: example.com/taint-ep-base
              operator: "Exists"
  1. Observe Karpenter controller fail to provision a GPU node.

Versions:

  • Chart Version: latest (">=1.0.0 <2.0.0")
  • Kubernetes Version (kubectl version):
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2-eks-db838b0
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

csm-kb avatar Aug 28 '24 02:08 csm-kb