karpenter-provider-aws
karpenter-provider-aws copied to clipboard
Accelerated GPU instance NodePool definition yields error "no instance type satisfied resources"
Description
Context:
Hey! I have Karpenter deployed very neatly to an EKS cluster using FluxCD to automatically manage Helm charts:
(click to expand) Helm release for Karpenter
# including HelmRepository here, even though it is in a separate file
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: karpenter
namespace: flux-system
spec:
type: "oci"
url: oci://public.ecr.aws/karpenter
interval: 30m
---
apiVersion: v1
kind: Namespace
metadata:
name: karpenter
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: karpenter-crd
namespace: karpenter
spec:
interval: 5m
chart:
spec:
chart: karpenter-crd
version: ">=1.0.0 <2.0.0"
sourceRef:
kind: HelmRepository
name: karpenter
namespace: flux-system
install:
remediation:
retries: 3
values:
webhook:
enabled: true
serviceName: karpenter
serviceNamespace: karpenter
port: 8443
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: karpenter
namespace: karpenter
spec:
interval: 5m
chart:
spec:
chart: karpenter
version: ">=1.0.0 <2.0.0"
sourceRef:
kind: HelmRepository
name: karpenter
namespace: flux-system
install:
remediation:
retries: 3
values:
webhook:
enabled: true
port: 8443
replicas: 2
logLevel: debug
controller:
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 1
memory: 1Gi
settings:
clusterName: "bench-cluster"
interruptionQueue: "Karpenter-bench-cluster"
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>:role/KarpenterController-20240815204005347400000005"
I then have three NodePool
s (and associated EC2NodeClass
es) that take different workloads, depending on what pods get launched with what affinities/taints to request where they go. The two NodePool
s that rely on normal compute instance types like C/M/R work very well, and Karpenter works flawlessly to scale the node pools and serve those pods!
However...
Observed Behavior:
The third NodePool
is for workloads that require a G instance with NVIDIA compute to run.
Simple enough, right? YAML:
(click to expand) Karpenter resource definition YAML
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: ep-nodeclass
spec:
amiFamily: AL2
role: "bench-main-ng-eks-node-group-20240620210345707900000001"
subnetSelectorTerms:
- tags:
"karpenter.sh/discovery-bench-cluster": "true"
securityGroupSelectorTerms:
- tags:
"karpenter.sh/discovery-bench-cluster": "true"
amiSelectorTerms:
# acquired from https://github.com/awslabs/amazon-eks-ami/releases
- name: "amazon-eks-gpu-node-1.30-v*"
kubelet:
maxPods: 1
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: ep-base
spec:
template:
metadata:
labels:
example.com/taint-ep-base: "true"
annotations:
Env: "staging"
Project: "autotest"
spec:
taints:
- key: example.com/taint-ep-base
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
# - key: node.kubernetes.io/instance-type
# operator: In
# values: ["g5.2xlarge", "g6.2xlarge"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g5", "g6"]
- key: karpenter.k8s.aws/instance-gpu-count
operator: In
values: ["1"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: ep-nodeclass
expireAfter: 168h # 7 * 24h = 168h
limits:
cpu: 64
memory: 256Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
The CPU and memory limits are set just as the others are, and leave plenty of room for the G instance specs based on the docs.
This is defined identically to the other functional NodePool
s, except for the G instance family specifications (particularly the newer card offerings).
When Karpenter takes this in, and I launch a pod with the necessary Kubernetes specs:
# an Argo Workflow launches this pod
resources: # not required; I have tried this without this subset
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
affinity: # standard; works flawlessly to route pods to the tc-base and tc-heavy NodePools
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "{{workflow.parameters.ep_pod_tolerance}}"
operator: "Exists"
It validates it successfully and attempts to spin up a node to serve it... to yield the following:
(click to expand) kubectl logs output of JSON, formatted
{
"level": "DEBUG",
"time": "2024-08-26T21:51:35.612Z",
"logger": "controller",
"caller": "scheduling/scheduler.go:220",
"message": "226 out of 801 instance types were excluded because they would breach limits",
"commit": "62a726c",
"controller": "provisioner",
"namespace": "",
"name": "",
"reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238",
"NodePool": {
"name": "ep-base"
}
}
{
"level": "ERROR",
"time": "2024-08-26T21:51:35.618Z",
"logger": "controller",
"caller": "provisioning/provisioner.go:355",
"message": "could not schedule pod",
"commit": "62a726c",
"controller": "provisioner",
"namespace": "",
"name": "",
"reconcileID": "0c544d27-9a71-4c1d-9c72-839aea9c9238",
"Pod": {
"name": "e2e-test-stage-kane-p7wck-edge-pipeline-pickle-2973982407",
"namespace": "argo"
},
"error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule; incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule; incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [g5.2xlarge g6.2xlarge], example.com/taint-ep-base In [true] (no instance type has enough resources)",
"errorCauses": [{
"error": "incompatible with nodepool \"tc-heavy\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-heavy=:NoSchedule"
}, {
"error": "incompatible with nodepool \"tc-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, did not tolerate example.com/taint-tc-base=:NoSchedule"
}, {
"error": "incompatible with nodepool \"ep-base\", daemonset overhead={\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"4\"}, no instance type satisfied resources {\"cpu\":\"380m\",\"memory\":\"376Mi\",\"pods\":\"5\"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/nodepool In [ep-base], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-family In [g5 g6], example.com/taint-ep-base In [true] (no instance type has enough resources)"
}
]
}
The scheduler
does checks to filter instance types on available limit overhead -- but no matter what set of configs I try, the provisioner
just refuses to take without being more explicit about what resources are missing from the instance types it can see (even though the instance types desired very much support the small resource requirements it is reporting on).
Notes and things I have tirelessly tried to get around this:
- The EKS cluster and all associated infrastructure (managed by Terraform) is running in
us-east-1
. - I made sure there was plenty of overhead G instance quota in the AWS account to support this operation.
- The run above had a workload pod not specify the
nvidia.com/gpu
resource requirement to try and rule that out, to no avail. - I loosened the requirements to any G instance type and removed instance-gpu-count`; observed the same issue.
- I increased the resource limits to
cpu: 1000
andmemory: 1024Gi
(and removed thenvidia.com/gpu
limit I had at one point) and watched the filtered instance type count decrease; observed the same issue. - I modified/generified the
EC2NodeClass
AMI selection to use the latest version of any of the three supported types (AL2, AL2023, Bottlerocket) one after another; observed the same issue. - I drilled into Karpenter source to understand where exactly this error is emitted, but did not modify it to obtain visibility into which of the 801 instance types the controller had discarded for consideration.
- I made sure the
nvidia-k8s-device-plugin
was provisioned and active to the latest version inkube-system
namespace (there is a manual EC2 G-instance node group in this cluster that is also used for active workloads as a live workaround).
Expected Behavior:
One of two things:
- The Karpenter controller provides observability into which instance types it filtered for resource selection and why, in order to correlate + disambiguate the provided
error
above as a config error, internal bug, or cluster bug. - The Karpenter controller selects a G instance type (like a nice
g4dn/g5/g6.2xlarge
) to spawn, do so, then assign the pod to the node and let it work its magic.
Reproduction Steps (Please include YAML):
- Deploy below Karpenter resources YAML to a clean Karpenter-enabled EKS cluster (latest:
v1.0.1
):
(click to expand) Karpenter resources YAML
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: ep-nodeclass
spec:
amiFamily: AL2
role: "bench-main-ng-eks-node-group-20240620210345707900000001"
subnetSelectorTerms:
- tags:
"karpenter.sh/discovery-bench-cluster": "true"
securityGroupSelectorTerms:
- tags:
"karpenter.sh/discovery-bench-cluster": "true"
amiSelectorTerms:
# acquired from https://github.com/awslabs/amazon-eks-ami/releases
- name: "amazon-eks-gpu-node-1.30-v*"
kubelet:
maxPods: 1
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: ep-base
spec:
template:
metadata:
labels:
example.com/taint-ep-base: "true"
annotations:
Env: "staging"
Project: "autotest"
spec:
taints:
- key: example.com/taint-ep-base
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
# - key: node.kubernetes.io/instance-type
# operator: In
# values: ["g5.2xlarge", "g6.2xlarge"]
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["g5", "g6"]
- key: karpenter.k8s.aws/instance-gpu-count
operator: In
values: ["1"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: ep-nodeclass
expireAfter: 168h # 7 * 24h = 168h
limits:
cpu: 64
memory: 256Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
- Attempt to run any test pod with the following Kubernetes requirements:
(click to expand) YAML subset for requirements
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: example.com/taint-ep-base
operator: "Exists"
- Observe Karpenter controller fail to provision a GPU node.
Versions:
- Chart Version: latest (
">=1.0.0 <2.0.0"
) - Kubernetes Version (
kubectl version
):
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.2-eks-db838b0
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment