microk8s
microk8s copied to clipboard
Can't get nvidia-smi to work in a pod
Summary
Hello, I'm trying to run microk8s with GPU operator on a Dell G7-7790 laptop with a RTX260 running a fresh install of Ubuntu 22.04. I'm unable to access my GPU, and i don't see any error
I've try many reinstall of ubuntu / nvidia drivers, containers, cuda / microk8s (with operator driver and auto) / gpu operator without success.
What Should Happen Instead?
I expect to get nvidia-smi information into the pod, maybe i'm missing something
Reproduction Steps
I've install a fresh ubuntu 22 with nvidia driver 545, nvidia containerand nvidia cuda. Then i've installed micro k8s following this guide
sudo snap install microk8s --classic --channel=1.29
After i install gpu operator:
microk8s enable nvidia --driver operator (same result with auto)
I try to nvidia-smi with this pod:
microk8s kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: nvidia/cuda:12.2.2-devel-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
EOF
And get this result
chaya@chaya-G7-7790:~$ k describe po nvidia-smi
Name: nvidia-smi
Namespace: default
Priority: 0
Service Account: default
Node: chaya-g7-7790/192.168.50.20
Start Time: Sun, 11 Feb 2024 23:14:08 +0100
Labels: <none>
Annotations: cni.projectcalico.org/containerID: c025c1310767322f48e48b0a36b4a31db2e80a3e821577866ab4805906d57cde
cni.projectcalico.org/podIP: 10.1.57.185/32
cni.projectcalico.org/podIPs: 10.1.57.185/32
Status: Running
IP: 10.1.57.185
IPs:
IP: 10.1.57.185
Containers:
nvidia-smi:
Container ID: containerd://36d1dbb68c663529debb797bb4aaac1eb5d1ee113a6c346f5b62dd046ab8bb10
Image: nvidia/cuda:12.2.2-devel-ubuntu22.04
Image ID: docker.io/nvidia/cuda@sha256:ae8a022c02aec945c4f8c52f65deaf535de7abb58e840350d19391ec683f4980
Port: <none>
Host Port: <none>
Command:
nvidia-smi
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 01:00:00 +0100
Finished: Sun, 11 Feb 2024 23:17:16 +0100
Ready: False
Restart Count: 5
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v2vct (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-v2vct:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m49s default-scheduler Successfully assigned default/nvidia-smi to chaya-g7-7790
Normal Pulled 2m15s (x5 over 3m47s) kubelet Container image "nvidia/cuda:12.2.2-devel-ubuntu22.04" already present on machine
Normal Created 2m15s (x5 over 3m47s) kubelet Created container nvidia-smi
Warning Failed 2m15s (x5 over 3m47s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown
Warning BackOff 2m2s (x10 over 3m46s) kubelet Back-off restarting failed container nvidia-smi in pod nvidia-smi_default(9bd20bd4-edda-4e5a-91d6-aced3878acea)
The path in the pod is:
root@nvidia-smi:/# echo $PATH
/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Directory in the path don't exist in the pod
root@nvidia-smi:/# ls -la /usr/local/
bin/ cuda/ cuda-12/ cuda-12.2/ etc/ games/ include/ lib/ man/ sbin/ share/ src/
State of my GPU operator pods:
chaya@chaya-G7-7790:~$ k -n gpu-operator-resources get po
NAME READY STATUS RESTARTS AGE
gpu-operator-node-feature-discovery-worker-zpsj8 1/1 Running 0 47m
nvidia-container-toolkit-daemonset-rvxbw 1/1 Running 0 47m
gpu-operator-559f7cd69b-dhq22 1/1 Running 0 47m
gpu-operator-node-feature-discovery-master-5bfbc54c8d-mlzpt 1/1 Running 0 47m
nvidia-cuda-validator-4bkzt 0/1 Completed 0 47m
nvidia-dcgm-exporter-9jlvx 1/1 Running 0 47m
nvidia-device-plugin-daemonset-hnbng 1/1 Running 0 47m
gpu-feature-discovery-6vkcw 1/1 Running 0 47m
nvidia-device-plugin-validator-q7kdh 0/1 Completed 0 47m
nvidia-operator-validator-l9xmc 1/1 Running 0 47m
When i do nvidia-smi on host i get this information:
chaya@chaya-G7-7790:~$ nvidia-smi
Sun Feb 11 23:19:04 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 Off | 00000000:01:00.0 Off | N/A |
| N/A 59C P0 16W / 80W | 6MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1386 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
Introspection Report
chaya@chaya-G7-7790:~$ microk8s inspect
[sudo] password for chaya:
Inspecting system
WARNING: The hostname of this server is 'chaya-G7-7790'.
Having uppercase letters in the hostname may cause issues with RBAC.
Consider changing the hostname to only have lowercase letters with:
hostnamectl set-hostname chaya-g7-7790
Inspecting Certificates
Inspecting services
Service snap.microk8s.daemon-cluster-agent is running
Service snap.microk8s.daemon-containerd is running
Service snap.microk8s.daemon-kubelite is running
Service snap.microk8s.daemon-k8s-dqlite is running
Service snap.microk8s.daemon-apiserver-kicker is running
Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
Copy processes list to the final report tarball
Copy disk usage information to the final report tarball
Copy memory usage information to the final report tarball
Copy server uptime to the final report tarball
Copy openSSL information to the final report tarball
Copy snap list to the final report tarball
Copy VM name (or none) to the final report tarball
Copy current linux distribution to the final report tarball
Copy asnycio usage and limits to the final report tarball
Copy inotify max_user_instances and max_user_watches to the final report tarball
Copy network configuration to the final report tarball
Inspecting kubernetes cluster
Inspect kubernetes cluster
Inspecting dqlite
Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory
WARNING: This machine's hostname contains capital letters and/or underscores.
This is not a valid name for a Kubernetes node, causing node registration to fail.
Please change the machine's hostname or refer to the documentation for more details:
https://microk8s.io/docs/troubleshooting#heading--common-issues
WARNING: Maximum number of inotify user watches is less than the recommended value of 1048576.
Increase the limit with:
echo fs.inotify.max_user_watches=1048576 | sudo tee -a /etc/sysctl.conf
sudo sysctl --system
Building the report tarball
Report tarball is at /var/snap/microk8s/6364/inspection-report-20240211_232750.tar.gz
Can you suggest a fix?
nop
Are you interested in contributing with a fix?
ask me anything, i'll be glad to help
Hi @abstract-entity
I managed to get it working, I also ran into this exact same issue, with a clean install of: ubuntu: 22.04, Nvidia: 550 microk8s: 1.29
There are a few things I had to do, firstly turns out I was missing the runtimeClassName
in my spec like this:
According to the Nvidia gpu-operator (which gets deployed when you run: microk8s enable gpu
) documentation,
the CONTAINERD_RUNTIME_CLASS
defaults to nvidia
, you need to specify the runtimeClassName
also in your spec for the operator to inject the nvidia-container-runtime
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
runtimeClassName: nvidia
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: nvidia/cuda:12.2.2-devel-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
Because I wanted to use host drivers and runtime, I also had to update the containerd toml with the correct nvidia-container-runtime
path:
$ microk8s disable gpu
$ vi /var/snap/microk8s/current/args/containerd-template.toml
Update these plugins:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-container-runtime.options]
BinaryName = "nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
BinaryName = "/usr/bin/nvidia-container-runtime-experimental"
$ sudo snap restart microk8s
$ microk8s enable gpu --set toolkit.enabled=false
It's important to disable the toolkit --set toolkit.enabled=false
, because you want to use the host drivers and runtime as explained here:
https://microk8s.io/docs/addon-gpu
I hope this also works for you :)
@Tiaanjw I used your template of using runtimeClassName
which you specified at the container level, it didn't work for me, However, once I wrote that at the pod level, it worked for me. - ref
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
namespace: default
spec:
runtimeClassName: nvidia
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
@pishangujeniya you are right! thanks, I have updated my solution!
@abstract-entity Did the proposed solution work? If so, can you close this issue?