kind-with-gpus-examples Unable to start pods for the device plugin

trafficstars

Hi Kevin,

I am new to K8S, so please bear with me if my question seems basic. I have successfully created a cluster and installed the device plugin, but I'm having an issue where the daemonset is not being scheduled. Below are the details, could you help me understand why the daemonset is not being scheduled? Thank you!

$ kubectl --context=kind-${KIND_CLUSTER_NAME} get pod -n nvidia
No resources found in nvidia namespace.

$ kubectl describe daemonset nvidia-device-plugin -n nvidia --context=kind-${KIND_CLUSTER_NAME}
Name:           nvidia-device-plugin
Selector:       app.kubernetes.io/instance=nvidia-device-plugin,app.kubernetes.io/name=nvidia-device-plugin
Node-Selector:  <none>
Labels:         app.kubernetes.io/instance=nvidia-device-plugin
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=nvidia-device-plugin
                app.kubernetes.io/version=0.15.0
                helm.sh/chart=nvidia-device-plugin-0.15.0
Annotations:    deprecated.daemonset.template.generation: 1
                meta.helm.sh/release-name: nvidia-device-plugin
                meta.helm.sh/release-namespace: nvidia
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app.kubernetes.io/instance=nvidia-device-plugin
           app.kubernetes.io/name=nvidia-device-plugin
  Containers:
   nvidia-device-plugin-ctr:
    Image:      nvcr.io/nvidia/k8s-device-plugin:v0.15.0
    Port:       <none>
    Host Port:  <none>
    Command:
      nvidia-device-plugin
    Environment:
      MPS_ROOT:                    /run/nvidia/mps
      NVIDIA_MIG_MONITOR_DEVICES:  all
      NVIDIA_VISIBLE_DEVICES:      all
      NVIDIA_DRIVER_CAPABILITIES:  compute,utility
    Mounts:
      /dev/shm from mps-shm (rw)
      /mps from mps-root (rw)
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/cdi from cdi-root (rw)
  Volumes:
   device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
   mps-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/mps
    HostPathType:  DirectoryOrCreate
   mps-shm:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/mps/shm
    HostPathType:  
   cdi-root:
    Type:               HostPath (bare host directory volume)
    Path:               /var/run/cdi
    HostPathType:       DirectoryOrCreate
  Priority Class Name:  system-node-critical
  Node-Selectors:       <none>
  Tolerations:          CriticalAddonsOnly op=Exists
                        nvidia.com/gpu:NoSchedule op=Exists
Events:                 <none>

Jun 14 '24 07:06 zhewenhu

I have the same issue. Did you manage to solve the issue? If so, could you share how you did it?

EDIT:

I tried running

./nvkind cluster create --name test

and

./nvkind cluster create \
--config-template=examples/one-worker-per-gpu.yaml --name test

In both causes, I run into the same problem as shown by @zhewenhu. Specifically, the first steps that fail in the README is:

$ kubectl --context=kind-${KIND_CLUSTER_NAME} get pod -n nvidia
No resources found in nvidia namespace.

and:

$ kubectl --context=kind-${KIND_CLUSTER_NAME} get nodes -o json | jq -r '.items[] | select(.metadata.name | test("-worker[0-9]*$")) | {name: .metadata.name, "nvidia.com/gpu": .status.allocatable["nvidia.com/gpu"]}'
{
  "name": "test-worker",
  "nvidia.com/gpu": null
}

However, nvkind does see the GPU:

$ ./nvkind cluster print-gpus
[
    {
        "node": "test-worker",
        "gpus": [
            {
                "Index": "0",
                "Name": "Tesla T4",
                "UUID": "GPU-74dd6868-3150-a556-2530-3cf451fa7603"
            }
        ]
    }
]

Aug 05 '24 13:08 jdonkervliet

Hi @jdonkervliet, unfortunately I didn’t figure it out and I switched to minikube instead.

Aug 06 '24 02:08 zhewenhu

I've updated the examples so things should work now. The nodes needed a nvidia.com/gpu.present label for the plugin to be able to deploy to them.

https://github.com/klueska/nvkind/pull/8

Sep 20 '24 08:09 klueska

I've updated the examples so things should work now. The nodes needed a nvidia.com/gpu.present label for the plugin to be able to deploy to them.

#8

Should this change affect https://github.com/NVIDIA/nvkind/blob/main/pkg/nvkind/default-config-template.yaml also then? Label is not applied If no explicit config is provided during cluster create.

Feb 13 '25 12:02 LogExE

kind-with-gpus-examples kind-with-gpus-examples copied to clipboard

Unable to start pods for the device plugin

kind-with-gpus-examples
kind-with-gpus-examples copied to clipboard