k8s-device-plugin
k8s-device-plugin copied to clipboard
Pod goes OutOfnvidia.com/gpu before k8s-device-plugin is ready
I have an issue when using cluster autoscaling for GPU nodes.
I am using Karpenter as the cluster autoscaler and I'm trying to deploy NVidia Riva. The pod deployment spec has
resources:
limits:
nvidia.com/gpu: 1
...
tolerations:
- effect: NoExecute
key: rivaOnly
operator: Exists
nodeSelector:
karpenter.sh/provisioner-name: gpu-riva
When I create the deployment Karpenter provisions a new node almost instantly with the correct taint, and kubernetes assigns the pod to the new node with a Pending status.
The problem is that at this point there are no nvidia.com/gpu resources as the k8s-device-plugin hasn't started yet. Approximately 90 seconds later the new node is finally up enough that the k8s-device-plugin daemonset launches a pod and discovers the GPU resource, however by this time the Riva pod has changed status to OutOfnvidia.com/gpu
and is now stuck.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m17s (x3 over 3m20s) default-scheduler 0/9 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {ocrOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {tritonOnly: true}, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector.
Warning OutOfnvidia.com/gpu 2m5s kubelet Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0
Kubernetes seems smart enough that it eventually tries to launch a second Riva pod which is able to use the recently provisioned machine, but it generally means a 90-300s delay before the pod actually starts initialising (which also takes a long time as Riva is huge). Sometimes however it causes Karpenter to provision another node, and the process repeats.
riva-riva-api-768b77d764-9dg2b 0/2 OutOfnvidia.com/gpu 0 3m13s
riva-riva-api-768b77d764-ft2hf 0/2 OutOfnvidia.com/gpu 0 115s
riva-riva-api-768b77d764-njq44 0/2 Init:0/2 0 54s
Is there a way to make the Riva deployment wait longer for the nvidia.com/gpu resources to become available?
Thanks.
This seems like a question more relevant for Karpenter or Riva than the device plugin.
Seems maybe related to https://stackoverflow.com/questions/68951748/pods-getting-scheduled-irrespective-of-insufficient-resources
Are you on Kubernetes v1.22?
Seems maybe related to https://stackoverflow.com/questions/68951748/pods-getting-scheduled-irrespective-of-insufficient-resources
Are you on Kubernetes v1.22?
I'm on v1.21 (specifically v1.21.5-eks-bc4871b)
I'm not sure it's Riva or Karpenter specific as the node is being provisioned fine by Karpenter... and the Riva pod doesn't even start.
It seems to be some sort of race condition that because it takes a few seconds between the new node becoming Ready and the k8s-device-plugin to launch and make the nvidia.com/gpu resources available to kubelet.
During that time kubernetes seems to decide that the lack of available resources is fatal and puts the pod into its stuck OutOfnvidia.com/gpu state, and launches a new one (which can trigger Karpenter to provision yet another node)
I'm not aware of anything that can be set on a deployment/pod spec to make it wait longer for the nvidia.com/gpu resource to become available?
The strange thing is that the scheduler should not be scheduling the pod to the node until it sees that nvidia.com/gpu
resources are present on it. It's like it somehow schedules them to the node (because it sees nvidia.com/gpu
resources appear on the node from the perspective of the API server -- which the kubelet
would be the one to write there after the plugin registered with it), but then once the pod lands on the node, the kubelet
says "I don't have any nvidia.com/gpu
resources, not sure why you got scheduled here --> OutOfnvidia.com/gpu
".
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.