`nvidia-smi` show no process information but the process is running on gpu
I have deployed gpu-operator on a rke2 cluster with the following values.yaml
nfd:
enabled: true
mig:
strategy: single
psp:
enabled: false
driver:
enabled: true
repository: nvcr.io/nvidia
version: "525.60.13"
rdma:
enabled: false
useHostMofed: false
operator:
defaultRuntime: containerd
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
and using the command below:
helm upgrade --cleanup-on-fail --install nvidia-gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace -f values.yaml
It was installed successfully and I haved checked that the gpu could be loaded into containers and an example yaml looks like below:
apiVersion: v1
kind: Pod
metadata:
name: mybusybox
spec:
runtimeClassName: nvidia # which is necessary
containers:
- name: mybusybox
image: busybox:latest
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
resources:
limits:
nvidia.com/gpu: 1
I wonder if there would be some configuration errors because I could not see any process shown with nvidia-smi when I setup some pytorch job on it.
the output of nvidia-smi looks like below,
# nvidia-smi
Tue Jan 30 19:53:47 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:04:00.0 Off | Off |
| 30% 45C P2 131W / 300W | 22557MiB / 49140MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:0C:00.0 Off | Off |
| 30% 28C P8 29W / 300W | 47701MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 On | 00000000:13:00.0 Off | Off |
| 30% 48C P2 152W / 300W | 33643MiB / 49140MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 On | 00000000:1B:00.0 Off | Off |
| 30% 44C P2 155W / 300W | 27981MiB / 49140MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
as we can see, memory of gpus are allocated but I could not see any process.
Is it a correct behavior and how could I debug it if it's abnormal. great thanks for your help.
@zeddit this is a known limitation currently. More details on this limitation discussed https://github.com/NVIDIA/nvidia-docker/issues/179#issuecomment-242150861. You could run the command from the nvidia-driver-daemonset pod which shows all active processes using the GPU.