k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Pod goes OutOfnvidia.com/gpu before k8s-device-plugin is ready

Open mattcamp opened this issue 3 years ago • 5 comments

I have an issue when using cluster autoscaling for GPU nodes.

I am using Karpenter as the cluster autoscaler and I'm trying to deploy NVidia Riva. The pod deployment spec has

resources:
  limits:
    nvidia.com/gpu: 1

...

tolerations:
  - effect: NoExecute
    key: rivaOnly
    operator: Exists
nodeSelector:
  karpenter.sh/provisioner-name: gpu-riva

When I create the deployment Karpenter provisions a new node almost instantly with the correct taint, and kubernetes assigns the pod to the new node with a Pending status.

The problem is that at this point there are no nvidia.com/gpu resources as the k8s-device-plugin hasn't started yet. Approximately 90 seconds later the new node is finally up enough that the k8s-device-plugin daemonset launches a pod and discovers the GPU resource, however by this time the Riva pod has changed status to OutOfnvidia.com/gpu and is now stuck.

Events:
  Type     Reason               Age                    From               Message
  ----     ------               ----                   ----               -------
  Warning  FailedScheduling     3m17s (x3 over 3m20s)  default-scheduler  0/9 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had taint {ocrOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {tritonOnly: true}, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector.
  Warning  OutOfnvidia.com/gpu  2m5s                   kubelet            Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0

Kubernetes seems smart enough that it eventually tries to launch a second Riva pod which is able to use the recently provisioned machine, but it generally means a 90-300s delay before the pod actually starts initialising (which also takes a long time as Riva is huge). Sometimes however it causes Karpenter to provision another node, and the process repeats.

riva-riva-api-768b77d764-9dg2b   0/2     OutOfnvidia.com/gpu   0          3m13s
riva-riva-api-768b77d764-ft2hf   0/2     OutOfnvidia.com/gpu   0          115s
riva-riva-api-768b77d764-njq44   0/2     Init:0/2              0          54s

Is there a way to make the Riva deployment wait longer for the nvidia.com/gpu resources to become available?

Thanks.

mattcamp avatar Feb 16 '22 09:02 mattcamp

This seems like a question more relevant for Karpenter or Riva than the device plugin.

klueska avatar Feb 16 '22 10:02 klueska

Seems maybe related to https://stackoverflow.com/questions/68951748/pods-getting-scheduled-irrespective-of-insufficient-resources

Are you on Kubernetes v1.22?

klueska avatar Feb 16 '22 10:02 klueska

Seems maybe related to https://stackoverflow.com/questions/68951748/pods-getting-scheduled-irrespective-of-insufficient-resources

Are you on Kubernetes v1.22?

I'm on v1.21 (specifically v1.21.5-eks-bc4871b)

I'm not sure it's Riva or Karpenter specific as the node is being provisioned fine by Karpenter... and the Riva pod doesn't even start.

It seems to be some sort of race condition that because it takes a few seconds between the new node becoming Ready and the k8s-device-plugin to launch and make the nvidia.com/gpu resources available to kubelet.

During that time kubernetes seems to decide that the lack of available resources is fatal and puts the pod into its stuck OutOfnvidia.com/gpu state, and launches a new one (which can trigger Karpenter to provision yet another node)

I'm not aware of anything that can be set on a deployment/pod spec to make it wait longer for the nvidia.com/gpu resource to become available?

mattcamp avatar Feb 16 '22 12:02 mattcamp

The strange thing is that the scheduler should not be scheduling the pod to the node until it sees that nvidia.com/gpu resources are present on it. It's like it somehow schedules them to the node (because it sees nvidia.com/gpu resources appear on the node from the perspective of the API server -- which the kubelet would be the one to write there after the plugin registered with it), but then once the pod lands on the node, the kubelet says "I don't have any nvidia.com/gpu resources, not sure why you got scheduled here --> OutOfnvidia.com/gpu".

klueska avatar Feb 16 '22 12:02 klueska

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]