karpenter-provider-aws icon indicating copy to clipboard operation
karpenter-provider-aws copied to clipboard

Accelerated GPU instance NodePool definition yields error "no instance type satisfied resources"

Open csm-kb opened this issue 1 year ago • 10 comments

Description

Context:

Hey! I have Karpenter deployed very neatly to an EKS cluster using FluxCD to automatically manage Helm charts:

(click to expand) Helm release for Karpenter
# including HelmRepository here, even though it is in a separate file
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: karpenter
  namespace: flux-system
spec:
  type: "oci"
  url: oci://public.ecr.aws/karpenter
  interval: 30m
---
apiVersion: v1
kind: Namespace
metadata:
  name: karpenter
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: karpenter-crd
  namespace: karpenter
spec:
  interval: 5m
  chart:
    spec:
      chart: karpenter-crd
      version: ">=1.0.0 <2.0.0"
      sourceRef:
        kind: HelmRepository
        name: karpenter
        namespace: flux-system
  install:
    remediation:
      retries: 3
  values:
    webhook:
      enabled: true
      serviceName: karpenter
      serviceNamespace: karpenter
      port: 8443
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: karpenter
  namespace: karpenter
spec:
  interval: 5m
  chart:
    spec:
      chart: karpenter
      version: ">=1.0.0 <2.0.0"
      sourceRef:
        kind: HelmRepository
        name: karpenter
        namespace: flux-system
  install:
    remediation:
      retries: 3
  values:
    webhook:
      enabled: true
      port: 8443
    replicas: 2
    logLevel: debug
    controller:
      resources:
        requests:
          cpu: 1
          memory: 1Gi
        limits:
          cpu: 1
          memory: 1Gi
    settings:
      clusterName: "bench-cluster"
      interruptionQueue: "Karpenter-bench-cluster"
    serviceAccount:
      create: true
      annotations:
        eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>:role/KarpenterController-20240815204005347400000005"

I then have three NodePools (and associated EC2NodeClasses) that take different workloads, depending on what pods get launched with what affinities/taints to request where they go. The two NodePools that rely on normal compute instance types like C/M/R work very well, and Karpenter works flawlessly to scale the node pools and serve those pods!

However...

Observed Behavior:

The third NodePool is for workloads that require a G instance with NVIDIA compute to run.

Simple enough, right? YAML:

(click to expand) Karpenter resource definition YAML
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: ep-nodeclass
spec:
  amiFamily: AL2
  role: "bench-main-ng-eks-node-group-20240620210345707900000001"
  subnetSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  securityGroupSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  amiSelectorTerms:
    # acquired from https://github.com/awslabs/amazon-eks-ami/releases
    - name: "amazon-eks-gpu-node-1.30-v*"
  kubelet:
    maxPods: 1
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: ep-base
spec:
  template:
    metadata:
      labels:
        example.com/taint-ep-base: "true"
      annotations:
        Env: "staging"
        Project: "autotest"
    spec:
      taints:
      - key: example.com/taint-ep-base
        effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        # - key: node.kubernetes.io/instance-type
        #   operator: In
        #   values: ["g5.2xlarge", "g6.2xlarge"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g5", "g6"]
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: In
          values: ["1"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: ep-nodeclass
      expireAfter: 168h # 7 * 24h = 168h
  limits:
    cpu: 64
    memory: 256Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

The CPU and memory limits are set just as the others are, and leave plenty of room for the G instance specs based on the docs.

This is defined identically to the other functional NodePools, except for the G instance family specifications (particularly the newer card offerings).

When Karpenter takes this in, and I launch a pod with the necessary Kubernetes specs:

# an Argo Workflow launches this pod
      resources: # not required; I have tried this without this subset
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
    affinity: # standard; works flawlessly to route pods to the tc-base and tc-heavy NodePools
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: "{{workflow.parameters.ep_pod_tolerance}}"
              operator: "Exists"

It validates it successfully and attempts to spin up a node to serve it... to yield the following:

(click to expand) kubectl logs output of JSON, formatted
{
    "level": "DEBUG",
    "time": "2024-08-26T21:51:35.612Z",
    "logger": "controller",
    "caller": "scheduling/scheduler.go:220",
    "message": "226 out of 801 instance types were excluded because they would breach limits",
    "commit": "62a726c",
    "controller": "provisioner",
    "namespace": "",
    "name": "",
    "reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238",
    "NodePool": {
        "name": "ep-base"
    }
}
{
    "level": "ERROR",
    "time": "2024-08-26T21:51:35.618Z",
    "logger": "controller",
    "caller": "provisioning/provisioner.go:355",
    "message": "could not schedule pod",
    "commit": "62a726c",
    "controller": "provisioner",
    "namespace": "",
    "name": "",
    "reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238",
    "Pod": {
        "name": "e2e-test-stage-kane-p7wck-edge-pipeline-pickle-2973982407",
        "namespace": "argo"
    },
    "error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule; incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule; incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [g5.2xlarge g6.2xlarge], example.com/taint-ep-base In [true] (no instance type has enough resources)",
    "errorCauses": [{
            "error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule"
        }, {
            "error": "incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule"
        }, {
            "error": "incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-family In [g5 g6], example.com/taint-ep-base In [true] (no instance type has enough resources)"
        }
    ]
}

The scheduler does checks to filter instance types on available limit overhead -- but no matter what set of configs I try, the provisioner just refuses to take without being more explicit about what resources are missing from the instance types it can see (even though the instance types desired very much support the small resource requirements it is reporting on).

Notes and things I have tirelessly tried to get around this:

  • The EKS cluster and all associated infrastructure (managed by Terraform) is running in us-east-1.
  • I made sure there was plenty of overhead G instance quota in the AWS account to support this operation.
  • The run above had a workload pod not specify the nvidia.com/gpu resource requirement to try and rule that out, to no avail.
  • I loosened the requirements to any G instance type and removed instance-gpu-count`; observed the same issue.
  • I increased the resource limits to cpu: 1000 and memory: 1024Gi (and removed the nvidia.com/gpu limit I had at one point) and watched the filtered instance type count decrease; observed the same issue.
  • I modified/generified the EC2NodeClass AMI selection to use the latest version of any of the three supported types (AL2, AL2023, Bottlerocket) one after another; observed the same issue.
  • I drilled into Karpenter source to understand where exactly this error is emitted, but did not modify it to obtain visibility into which of the 801 instance types the controller had discarded for consideration.
  • I made sure the nvidia-k8s-device-plugin was provisioned and active to the latest version in kube-system namespace (there is a manual EC2 G-instance node group in this cluster that is also used for active workloads as a live workaround).

Expected Behavior:

One of two things:

  1. The Karpenter controller provides observability into which instance types it filtered for resource selection and why, in order to correlate + disambiguate the provided error above as a config error, internal bug, or cluster bug.
  2. The Karpenter controller selects a G instance type (like a nice g4dn/g5/g6.2xlarge) to spawn, do so, then assign the pod to the node and let it work its magic.

Reproduction Steps (Please include YAML):

  1. Deploy below Karpenter resources YAML to a clean Karpenter-enabled EKS cluster (latest: v1.0.1):
(click to expand) Karpenter resources YAML
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: ep-nodeclass
spec:
  amiFamily: AL2
  role: "bench-main-ng-eks-node-group-20240620210345707900000001"
  subnetSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  securityGroupSelectorTerms:
    - tags:
        "karpenter.sh/discovery-bench-cluster": "true"
  amiSelectorTerms:
    # acquired from https://github.com/awslabs/amazon-eks-ami/releases
    - name: "amazon-eks-gpu-node-1.30-v*"
  kubelet:
    maxPods: 1
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: ep-base
spec:
  template:
    metadata:
      labels:
        example.com/taint-ep-base: "true"
      annotations:
        Env: "staging"
        Project: "autotest"
    spec:
      taints:
      - key: example.com/taint-ep-base
        effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        # - key: node.kubernetes.io/instance-type
        #   operator: In
        #   values: ["g5.2xlarge", "g6.2xlarge"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g5", "g6"]
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: In
          values: ["1"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: ep-nodeclass
      expireAfter: 168h # 7 * 24h = 168h
  limits:
    cpu: 64
    memory: 256Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
  1. Attempt to run any test pod with the following Kubernetes requirements:
(click to expand) YAML subset for requirements
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: example.com/taint-ep-base
              operator: "Exists"
  1. Observe Karpenter controller fail to provision a GPU node.

Versions:

  • Chart Version: latest (">=1.0.0 <2.0.0")
  • Kubernetes Version (kubectl version):
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2-eks-db838b0
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

csm-kb avatar Aug 28 '24 02:08 csm-kb

I observed the same issue as you today.

rlindsberg avatar Sep 18 '24 21:09 rlindsberg

I am observing this bug as well. I have defined a nvidia.com/gpu resource requirement in my deployment manifest. I also have a separate gpu-nodepool which is using a bottlerocket AMI type nodeclass. The only configuration I have placed is that the instance-gpucount is 1. For some reason karpenter is rejecting a g5g.xlarge node claim which has 3000m+ cpu but is not able to schedule a deployment that requires only 150m. Please help.

mDSaifZia avatar Oct 21 '24 08:10 mDSaifZia

I think I figured it out. If it's a new account, you can check your karpenter pod logs, the gpu instances might fail to launch due to MaxSpotInstances exceeded. So your node claim will then be deleted and it will be displayed that there are no instances avaiable to satisfy your requirements. You may check these docs:- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html

mDSaifZia avatar Oct 24 '24 06:10 mDSaifZia

the gpu instances might fail to launch due to MaxSpotInstances exceeded

Hmm, but the node pool specifies both spot and on-demand nodes. So if spot node cannot satisfy the request then it should switch to on-demand nodes. Right?

rlindsberg avatar Oct 24 '24 09:10 rlindsberg

You're right. As per AWS docs, it should fall back to on-demand. My only other guesses come from daemonset issues. Somewhere it mentioned that it tries to add up daemonsets as well and include it in resources and if it can't be found in the available g5 g6 families defined above it will not schedule it. But even then again I doubt that is the actual cause here.

mDSaifZia avatar Oct 24 '24 09:10 mDSaifZia

I confirm that I'm facing the same issue trying to use G6 but no issues with G4dn. no instance type satisfied resources

Updated: Issue fixed ✅ The issue was that our NodePool was restricted to specific Availability Zones (AZs), and the configured AZ did not have this type of instance available.

isaac88 avatar Dec 19 '24 16:12 isaac88

Will take a look

GnatorX avatar Feb 10 '25 23:02 GnatorX

do you have a matching toleration

      taints:
      - key: example.com/taint-ep-base
        effect: NoSchedule

Can you show the full pod spec

GnatorX avatar Feb 11 '25 00:02 GnatorX

We faced a similar issue—G6 instance types weren’t available in the AZ where our EBS-backed StatefulSet was created. I would have expected Karpenter to surface a warning or error about this.

Michael-Noma avatar Feb 12 '25 16:02 Michael-Noma

any updates on the issue? im facing it too no matter if i try to provision G4dn or some other gpu machines im facing the "daemonset overhead" error on karpenter. and those types are available on the az

ronbutbul avatar May 12 '25 11:05 ronbutbul