bottlerocket
bottlerocket copied to clipboard
Node doesn't expose GPU resource on g4dn.[n]xlarge
Image I'm using: System Info:
- Kernel Version: 5.15.160
- OS Image: Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia)
- Operating System: linux
- Architecture: amd64
- Container Runtime Version: containerd://1.6.31+bottlerocket
- Kubelet Version: v1.26.14-eks-b063426
- Kube-Proxy Version: v1.26.14-eks-b063426
What I expected to happen:
100% of the time that in EKS
I start a Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6
node on a g4dn.[n]xlarge
instance-type it should expose the gpu count for pods.
Capacity:
...
nvidia.com/gpu: 1
...
Allocatable:
...
nvidia.com/gpu: 1
...
What actually happened:
~5% of the time that in EKS
I start a Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia) ami-09469fd78070eaac6
node on a g4dn.[n]xlarge
instance-type it didn't expose the gpu count for pods, causing pods requiring nvidia.com/gpu: 1
to not be scheduled, keeping them in pending state waiting for a node.
Capacity:
cpu: 8
ephemeral-storage: 61904460Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32366612Ki
pods: 29
Allocatable:
cpu: 7910m
ephemeral-storage: 55977408418
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31676436Ki
pods: 29
How to reproduce the problem: Note: This issue has existed for more than a year, you can see the slack thread here
Current settings:
- EKS K8s v1.26
- Karpenter Autoscaler v0.33.5
- Karpenter Nodepool:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: random-name
spec:
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
limits:
cpu: 1000
template:
metadata:
labels:
company.ai/node: random-name
spec:
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: random-name
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- g4dn.xlarge
- g4dn.2xlarge
- g4dn.4xlarge
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-central-1a
- eu-central-1b
- eu-central-1c
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
taints:
- effect: NoSchedule
key: nvidia.com/gpu
status:
resources:
cpu: "8"
ephemeral-storage: 61904460Ki
memory: 32366612Ki
nvidia.com/gpu: "1"
pods: "29"
vpc.amazonaws.com/pod-eni: "39"
- Karpenter EC2NodeClass:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: random-name
spec:
amiFamily: Bottlerocket
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
volumeSize: 4Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
iops: 3000
snapshotID: snap-d4758cc7f5f11
throughput: 500
volumeSize: 60Gi
volumeType: gp3
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required
role: KarpenterNodeRole-prod
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod
subnetSelectorTerms:
- tags:
Name: '*Private*'
karpenter.sh/discovery: prod
tags:
nodepool: random-name
purpose: prod
vendor: random-name
status:
amis:
- id: ami-0a3eb13c0c420309b
name: bottlerocket-aws-k8s-1.26-nvidia-aarch64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: Exists
- id: ami-0a3eb13c0c420309b
name: bottlerocket-aws-k8s-1.26-nvidia-aarch64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-accelerator-count
operator: Exists
- id: ami-0e68f27f62340664d
name: bottlerocket-aws-k8s-1.26-aarch64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
- id: ami-09469fd78070eaac6
name: bottlerocket-aws-k8s-1.26-nvidia-x86_64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: Exists
- id: ami-09469fd78070eaac6
name: bottlerocket-aws-k8s-1.26-nvidia-x86_64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-accelerator-count
operator: Exists
- id: ami-096d4acd33c9e9449
name: bottlerocket-aws-k8s-1.26-x86_64-v1.20.3-5d9ac849
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
instanceProfile: prod_29098703412147
securityGroups:
- id: sg-71ed7b6a7c7
name: eks-cluster-sg-prod-684080
subnets:
- id: subnet-aa204f6e57f07
zone: eu-central-1a
- id: subnet-579874f746c1b
zone: eu-central-1c
- id: subnet-07ce8a8349377
zone: eu-central-1b
- Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deploy-gpu
spec:
template:
spec:
containers:
- ...
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
nodeSelector:
company.ai/node: random-name
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
...
-
Karpenter managed process:
- The Deployment is deployed in the cluster
- There are no nodes that can fulfill the requirements of
resources
,node labes
andtolerations
- karpenter detects it can fulfill the requirement with the described
NodePool
andEC2NodeClass
- karpenter create a new node of the pool
karpenter.sh/nodepool=random-name
- The node started is healthy but the scheduler can't schedule the pod on it because it is not exposing the gpu.
'0/16 nodes are available: 1 Insufficient nvidia.com/gpu, 10 node(s) didn''t match Pod''s node affinity/selector, 5 node(s) had untolerated taint {deepc-cpu: }. preemption: 0/16 nodes are available: 1 No preemption victims found for incoming pod, 15 Preemption is not helpful for scheduling..'
-
Node created:
kubectl describe no ip-192-168-164-242.eu-central-1.compute.internal
Name: ip-192-168-164-242.eu-central-1.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=g4dn.2xlarge
beta.kubernetes.io/os=linux
company.ai/node=random-name
failure-domain.beta.kubernetes.io/region=eu-central-1
failure-domain.beta.kubernetes.io/zone=eu-central-1b
k8s.io/cloud-provider-aws=e7b76f679c563363cec5c6d5c3
karpenter.k8s.aws/instance-category=g
karpenter.k8s.aws/instance-cpu=8
karpenter.k8s.aws/instance-encryption-in-transit-supported=true
karpenter.k8s.aws/instance-family=g4dn
karpenter.k8s.aws/instance-generation=4
karpenter.k8s.aws/instance-gpu-count=1
karpenter.k8s.aws/instance-gpu-manufacturer=nvidia
karpenter.k8s.aws/instance-gpu-memory=16384
karpenter.k8s.aws/instance-gpu-name=t4
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-local-nvme=225
karpenter.k8s.aws/instance-memory=32768
karpenter.k8s.aws/instance-network-bandwidth=10000
karpenter.k8s.aws/instance-size=2xlarge
karpenter.sh/capacity-type=spot
karpenter.sh/nodepool=random-name
karpenter.sh/registered=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-192-168-164-242.eu-central-1.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=g4dn.2xlarge
topology.ebs.csi.aws.com/zone=eu-central-1b
topology.kubernetes.io/region=eu-central-1
topology.kubernetes.io/zone=eu-central-1b
Annotations: alpha.kubernetes.io/provided-node-ip: 192.168.164.242
csi.volume.kubernetes.io/nodeid:
{"csi.tigera.io":"ip-192-168-164-242.eu-central-1.compute.internal","ebs.csi.aws.com":"i-0fd5f7a7969d63c9d"}
karpenter.k8s.aws/ec2nodeclass-hash: 15616957348189460630
karpenter.k8s.aws/ec2nodeclass-hash-version: v1
karpenter.sh/nodepool-hash: 14407783392627717656
karpenter.sh/nodepool-hash-version: v1
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 08 Jul 2024 08:57:11 -0500
Taints: nvidia.com/gpu:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-192-168-164-242.eu-central-1.compute.internal
AcquireTime: <unset>
RenewTime: Fri, 12 Jul 2024 13:43:42 -0500
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 12 Jul 2024 13:39:08 -0500 Mon, 08 Jul 2024 08:57:11 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 12 Jul 2024 13:39:08 -0500 Mon, 08 Jul 2024 08:57:11 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 12 Jul 2024 13:39:08 -0500 Mon, 08 Jul 2024 08:57:11 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 12 Jul 2024 13:39:08 -0500 Mon, 08 Jul 2024 08:57:19 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.164.242
InternalDNS: ip-192-168-164-242.eu-central-1.compute.internal
Hostname: ip-192-168-164-242.eu-central-1.compute.internal
Capacity:
cpu: 8
ephemeral-storage: 61904460Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32366612Ki
pods: 29
Allocatable:
cpu: 7910m
ephemeral-storage: 55977408418
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31676436Ki
pods: 29
System Info:
Machine ID: ec2a5800025dbe346ecc517c4de3
System UUID: ec2a58-0002-5dbe-346e-cc517c4de3
Boot ID: c25bd0fd-67f0-4681-946a-64f8c5c57878
Kernel Version: 5.15.160
OS Image: Bottlerocket OS 1.20.3 (aws-k8s-1.26-nvidia)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.31+bottlerocket
Kubelet Version: v1.26.14-eks-b063426
Kube-Proxy Version: v1.26.14-eks-b063426
ProviderID: aws:///eu-central-1b/i-d5f7a7969d63c
Non-terminated Pods: (8 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
calico-system calico-node-rlbvk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
calico-system csi-node-driver-x5jn7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
kube-system aws-node-txk8m 50m (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
kube-system ebs-csi-node-b8ln5 30m (0%) 0 (0%) 120Mi (0%) 768Mi (2%) 4d4h
kube-system kube-proxy-zkn9m 100m (1%) 0 (0%) 0 (0%) 0 (0%) 4d4h
loki loki-promtail-6644l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
monitoring monitoring-prometheus-node-exporter-fx749 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4d4h
tigera-operator tigera-operator-7b594b484b-rkn5g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d21h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 180m (2%) 0 (0%)
memory 120Mi (0%) 768Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
-
As you can see the node created fulfill the requirements of
node labes
andtolerations
but not theresources
(gpu) -
Inspecting the node:
Using the session manager -> amin-container -> sheltie
bash-5.1# lsmod | grep nvidia
nvidia_uvm 1454080 0
nvidia_modeset 1265664 0
nvidia 56004608 2 nvidia_uvm,nvidia_modeset
drm 626688 1 nvidia
backlight 24576 2 drm,nvidia_modeset
i2c_core 102400 2 nvidia,drm
bash-5.1# systemctl list-unit-files | grep nvidia
nvidia-fabricmanager.service enabled enabled
nvidia-k8s-device-plugin.service enabled enabled
bash-5.1# journalctl -b -u nvidia-k8s-device-plugin
-- No entries --
bash-5.1# journalctl -b -u nvidia-fabricmanager.service
-- No entries --
bash-5.1# journalctl --list-boots
IDX BOOT ID FIRST ENTRY LAST ENTRY
0 c25bd0fd67f04681946a64f8c5c57878 Tue 2024-07-09 22:06:03 UTC Fri 2024-07-12 22:02:13 UTC
From the slack thread, someone suggest this:
Grasping at straws, but I wonder if this is some sort of initialization race condition where the kubelet service starts before the NVIDIA device is ready.