kind-with-gpus-examples
kind-with-gpus-examples copied to clipboard
Unable to start pods for the device plugin
Hi Kevin,
I am new to K8S, so please bear with me if my question seems basic. I have successfully created a cluster and installed the device plugin, but I'm having an issue where the daemonset is not being scheduled. Below are the details, could you help me understand why the daemonset is not being scheduled? Thank you!
$ kubectl --context=kind-${KIND_CLUSTER_NAME} get pod -n nvidia
No resources found in nvidia namespace.
$ kubectl describe daemonset nvidia-device-plugin -n nvidia --context=kind-${KIND_CLUSTER_NAME}
Name: nvidia-device-plugin
Selector: app.kubernetes.io/instance=nvidia-device-plugin,app.kubernetes.io/name=nvidia-device-plugin
Node-Selector: <none>
Labels: app.kubernetes.io/instance=nvidia-device-plugin
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=nvidia-device-plugin
app.kubernetes.io/version=0.15.0
helm.sh/chart=nvidia-device-plugin-0.15.0
Annotations: deprecated.daemonset.template.generation: 1
meta.helm.sh/release-name: nvidia-device-plugin
meta.helm.sh/release-namespace: nvidia
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app.kubernetes.io/instance=nvidia-device-plugin
app.kubernetes.io/name=nvidia-device-plugin
Containers:
nvidia-device-plugin-ctr:
Image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
Port: <none>
Host Port: <none>
Command:
nvidia-device-plugin
Environment:
MPS_ROOT: /run/nvidia/mps
NVIDIA_MIG_MONITOR_DEVICES: all
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: compute,utility
Mounts:
/dev/shm from mps-shm (rw)
/mps from mps-root (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/cdi from cdi-root (rw)
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
mps-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/mps
HostPathType: DirectoryOrCreate
mps-shm:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/mps/shm
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
Priority Class Name: system-node-critical
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events: <none>
I have the same issue. Did you manage to solve the issue? If so, could you share how you did it?
EDIT:
I tried running
./nvkind cluster create --name test
and
./nvkind cluster create \
--config-template=examples/one-worker-per-gpu.yaml --name test
In both causes, I run into the same problem as shown by @zhewenhu. Specifically, the first steps that fail in the README is:
$ kubectl --context=kind-${KIND_CLUSTER_NAME} get pod -n nvidia
No resources found in nvidia namespace.
and:
$ kubectl --context=kind-${KIND_CLUSTER_NAME} get nodes -o json | jq -r '.items[] | select(.metadata.name | test("-worker[0-9]*$")) | {name: .metadata.name, "nvidia.com/gpu": .status.allocatable["nvidia.com/gpu"]}'
{
"name": "test-worker",
"nvidia.com/gpu": null
}
However, nvkind does see the GPU:
$ ./nvkind cluster print-gpus
[
{
"node": "test-worker",
"gpus": [
{
"Index": "0",
"Name": "Tesla T4",
"UUID": "GPU-74dd6868-3150-a556-2530-3cf451fa7603"
}
]
}
]
Hi @jdonkervliet, unfortunately I didn’t figure it out and I switched to minikube instead.
I've updated the examples so things should work now. The nodes needed a nvidia.com/gpu.present label for the plugin to be able to deploy to them.
https://github.com/klueska/nvkind/pull/8
I've updated the examples so things should work now. The nodes needed a nvidia.com/gpu.present label for the plugin to be able to deploy to them.
Should this change affect https://github.com/NVIDIA/nvkind/blob/main/pkg/nvkind/default-config-template.yaml also then? Label is not applied If no explicit config is provided during cluster create.