omnia icon indicating copy to clipboard operation
omnia copied to clipboard

k8s-device-plugin not deployed

Open j0hnL opened this issue 1 year ago • 2 comments

Describe the bug when a k8s-manager does not have a GPU Omnia will not deploy the k8s-device-plugin. We need to inspect the entire inventory for GPUs before deploying the plugin. I suggest we also taint or label any compute nodes that do not have GPUs because nvidia's plugin does not check. The AMD plugin seems to deploy just fine whether there are AMD accelerators or not.

j0hnL avatar Sep 06 '23 17:09 j0hnL

this is what i think:

Identify Nodes without GPUs: You need a mechanism to determine which compute nodes in your Kubernetes cluster do not have GPUs available. This can be done through manual inspection or automated scripts that query node specifications.

Node Labeling: Once you identify nodes without GPUs, apply labels to them using kubectl label nodes =. For example, you can label nodes without GPUs as gpu-enabled=false.

Node Tainting: Apply taints to nodes without GPUs to repel workloads that require GPUs. Taints prevent non-GPU workloads from being scheduled on these nodes. Use kubectl taint nodes =: to apply taints. For instance, you can use a taint like gpu-accelerator=false:NoSchedule.

Configure Workloads: Ensure that GPU-dependent workloads are configured to tolerate the taints or have node selectors that consider GPU availability. For example, in the Pod specification, you might add tolerations for the taints applied to nodes without GPUs.

naresh3774 avatar Jan 29 '24 02:01 naresh3774

This issue is fixed with PR #2238 .

@sujit-jadhav @j0hnL can we close this issue?

abhishek-sa1 avatar May 08 '24 11:05 abhishek-sa1