gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

`nvidia-smi` show no process information but the process is running on gpu

Open zeddit opened this issue 1 year ago • 1 comments

I have deployed gpu-operator on a rke2 cluster with the following values.yaml

nfd:
  enabled: true
mig:
  strategy: single
psp:
  enabled: false
driver:
  enabled: true
  repository: nvcr.io/nvidia
  version: "525.60.13"
  rdma:
    enabled: false
    useHostMofed: false
operator:
  defaultRuntime: containerd
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_CONFIG
    value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
  - name: CONTAINERD_SOCKET
    value: /run/k3s/containerd/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "true"

and using the command below:

helm upgrade --cleanup-on-fail --install nvidia-gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace -f values.yaml

It was installed successfully and I haved checked that the gpu could be loaded into containers and an example yaml looks like below:

apiVersion: v1 
kind: Pod
metadata:
  name: mybusybox
spec:
  runtimeClassName: nvidia # which is necessary
  containers:
  - name: mybusybox
    image: busybox:latest
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        nvidia.com/gpu: 1

I wonder if there would be some configuration errors because I could not see any process shown with nvidia-smi when I setup some pytorch job on it. the output of nvidia-smi looks like below,

# nvidia-smi
Tue Jan 30 19:53:47 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:04:00.0 Off |                  Off |
| 30%   45C    P2   131W / 300W |  22557MiB / 49140MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    On   | 00000000:0C:00.0 Off |                  Off |
| 30%   28C    P8    29W / 300W |  47701MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    On   | 00000000:13:00.0 Off |                  Off |
| 30%   48C    P2   152W / 300W |  33643MiB / 49140MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    On   | 00000000:1B:00.0 Off |                  Off |
| 30%   44C    P2   155W / 300W |  27981MiB / 49140MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

as we can see, memory of gpus are allocated but I could not see any process.

Is it a correct behavior and how could I debug it if it's abnormal. great thanks for your help.

zeddit avatar Jan 30 '24 11:01 zeddit

@zeddit this is a known limitation currently. More details on this limitation discussed https://github.com/NVIDIA/nvidia-docker/issues/179#issuecomment-242150861. You could run the command from the nvidia-driver-daemonset pod which shows all active processes using the GPU.

shivamerla avatar Feb 18 '24 06:02 shivamerla