microk8s icon indicating copy to clipboard operation
microk8s copied to clipboard

Can't get nvidia-smi to work in a pod

Open abstract-entity opened this issue 1 year ago • 4 comments

Summary

Hello, I'm trying to run microk8s with GPU operator on a Dell G7-7790 laptop with a RTX260 running a fresh install of Ubuntu 22.04. I'm unable to access my GPU, and i don't see any error

I've try many reinstall of ubuntu / nvidia drivers, containers, cuda / microk8s (with operator driver and auto) / gpu operator without success.

What Should Happen Instead?

I expect to get nvidia-smi information into the pod, maybe i'm missing something

Reproduction Steps

I've install a fresh ubuntu 22 with nvidia driver 545, nvidia containerand nvidia cuda. Then i've installed micro k8s following this guide

sudo snap install microk8s --classic --channel=1.29

After i install gpu operator:

microk8s enable nvidia --driver operator (same result with auto)

I try to nvidia-smi with this pod:

microk8s kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
    - name: nvidia-smi
      image: nvidia/cuda:12.2.2-devel-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

And get this result

chaya@chaya-G7-7790:~$ k describe po nvidia-smi
Name:             nvidia-smi
Namespace:        default
Priority:         0
Service Account:  default
Node:             chaya-g7-7790/192.168.50.20
Start Time:       Sun, 11 Feb 2024 23:14:08 +0100
Labels:           <none>
Annotations:      cni.projectcalico.org/containerID: c025c1310767322f48e48b0a36b4a31db2e80a3e821577866ab4805906d57cde
                  cni.projectcalico.org/podIP: 10.1.57.185/32
                  cni.projectcalico.org/podIPs: 10.1.57.185/32
Status:           Running
IP:               10.1.57.185
IPs:
  IP:  10.1.57.185
Containers:
  nvidia-smi:
    Container ID:  containerd://36d1dbb68c663529debb797bb4aaac1eb5d1ee113a6c346f5b62dd046ab8bb10
    Image:         nvidia/cuda:12.2.2-devel-ubuntu22.04
    Image ID:      docker.io/nvidia/cuda@sha256:ae8a022c02aec945c4f8c52f65deaf535de7abb58e840350d19391ec683f4980
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-smi
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 01:00:00 +0100
      Finished:     Sun, 11 Feb 2024 23:17:16 +0100
    Ready:          False
    Restart Count:  5
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v2vct (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-v2vct:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  3m49s                  default-scheduler  Successfully assigned default/nvidia-smi to chaya-g7-7790
  Normal   Pulled     2m15s (x5 over 3m47s)  kubelet            Container image "nvidia/cuda:12.2.2-devel-ubuntu22.04" already present on machine
  Normal   Created    2m15s (x5 over 3m47s)  kubelet            Created container nvidia-smi
  Warning  Failed     2m15s (x5 over 3m47s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown
  Warning  BackOff    2m2s (x10 over 3m46s)  kubelet            Back-off restarting failed container nvidia-smi in pod nvidia-smi_default(9bd20bd4-edda-4e5a-91d6-aced3878acea)

The path in the pod is:

root@nvidia-smi:/# echo $PATH
/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Directory in the path don't exist in the pod

root@nvidia-smi:/# ls -la /usr/local/
bin/       cuda/      cuda-12/   cuda-12.2/ etc/       games/     include/   lib/       man/       sbin/      share/     src/

State of my GPU operator pods:

chaya@chaya-G7-7790:~$ k -n gpu-operator-resources get po
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-operator-node-feature-discovery-worker-zpsj8              1/1     Running     0          47m
nvidia-container-toolkit-daemonset-rvxbw                      1/1     Running     0          47m
gpu-operator-559f7cd69b-dhq22                                 1/1     Running     0          47m
gpu-operator-node-feature-discovery-master-5bfbc54c8d-mlzpt   1/1     Running     0          47m
nvidia-cuda-validator-4bkzt                                   0/1     Completed   0          47m
nvidia-dcgm-exporter-9jlvx                                    1/1     Running     0          47m
nvidia-device-plugin-daemonset-hnbng                          1/1     Running     0          47m
gpu-feature-discovery-6vkcw                                   1/1     Running     0          47m
nvidia-device-plugin-validator-q7kdh                          0/1     Completed   0          47m
nvidia-operator-validator-l9xmc                               1/1     Running     0          47m

When i do nvidia-smi on host i get this information:

chaya@chaya-G7-7790:~$ nvidia-smi
Sun Feb 11 23:19:04 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P0              16W /  80W |      6MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1386      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

Introspection Report

chaya@chaya-G7-7790:~$ microk8s inspect
[sudo] password for chaya:
Inspecting system
WARNING:  The hostname of this server is 'chaya-G7-7790'.
Having uppercase letters in the hostname may cause issues with RBAC.
Consider changing the hostname to only have lowercase letters with:

    hostnamectl set-hostname chaya-g7-7790
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory

WARNING:  This machine's hostname contains capital letters and/or underscores.
          This is not a valid name for a Kubernetes node, causing node registration to fail.
          Please change the machine's hostname or refer to the documentation for more details:
          https://microk8s.io/docs/troubleshooting#heading--common-issues
WARNING:  Maximum number of inotify user watches is less than the recommended value of 1048576.
          Increase the limit with:
                 echo fs.inotify.max_user_watches=1048576 | sudo tee -a /etc/sysctl.conf
                 sudo sysctl --system
Building the report tarball
  Report tarball is at /var/snap/microk8s/6364/inspection-report-20240211_232750.tar.gz

Can you suggest a fix?

nop

Are you interested in contributing with a fix?

ask me anything, i'll be glad to help

abstract-entity avatar Feb 11 '24 22:02 abstract-entity

Hi @abstract-entity

I managed to get it working, I also ran into this exact same issue, with a clean install of: ubuntu: 22.04, Nvidia: 550 microk8s: 1.29

There are a few things I had to do, firstly turns out I was missing the runtimeClassName in my spec like this:

According to the Nvidia gpu-operator (which gets deployed when you run: microk8s enable gpu) documentation, the CONTAINERD_RUNTIME_CLASS defaults to nvidia, you need to specify the runtimeClassName also in your spec for the operator to inject the nvidia-container-runtime

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  containers:
    - name: nvidia-smi
      image: nvidia/cuda:12.2.2-devel-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1

Because I wanted to use host drivers and runtime, I also had to update the containerd toml with the correct nvidia-container-runtime path:

$ microk8s disable gpu
$ vi /var/snap/microk8s/current/args/containerd-template.toml

Update these plugins:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  privileged_without_host_devices = false
  runtime_engine = ""
  runtime_root = ""
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime.options]
  BinaryName = "nvidia-container-runtime"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
  BinaryName = "/usr/bin/nvidia-container-runtime-experimental"
$ sudo snap restart microk8s
$ microk8s enable gpu --set toolkit.enabled=false

It's important to disable the toolkit --set toolkit.enabled=false, because you want to use the host drivers and runtime as explained here: https://microk8s.io/docs/addon-gpu

I hope this also works for you :)

Tiaanjw avatar Feb 27 '24 13:02 Tiaanjw

@Tiaanjw I used your template of using runtimeClassName which you specified at the container level, it didn't work for me, However, once I wrote that at the pod level, it worked for me. - ref

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
  namespace: default
spec:
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1

pishangujeniya avatar Apr 04 '24 06:04 pishangujeniya

@pishangujeniya you are right! thanks, I have updated my solution!

Tiaanjw avatar Apr 04 '24 07:04 Tiaanjw

@abstract-entity Did the proposed solution work? If so, can you close this issue?

codespearhead avatar Apr 14 '24 22:04 codespearhead