nvtop
nvtop copied to clipboard
nvtop not detecting GPUs when used with Slurm
Hi, I'm facing a problem with nvtop + Slurm interactive session. I get an interactive session in a machine with two GPUs. Slurm controls access to them, so in this particular case I'm requesting just one of them. I verify that I can use the GPU for computation, and the tool nvidia-smi detects this GPU (it shows only one, because that is what Slurm is giving me access to), but as you can see below, nvtop says that there is no GPU to monitor. I have no idea what could be going on in here. Any ideas of how to debug this issue and/or things I could try?
Thanks,
$ nvidia-smi
Tue Jan 9 07:20:01 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P100-PCIE-12GB On | 00000000:02:00.0 Off | 0 |
| N/A 39C P0 25W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
$ nvtop
No GPU to monitor.
Hello,
I don't know how Slurm allocates the GPUs, could you check if the library libnvidia-ml.so is available?
That's the library used to get the GPU information, nvidia-smi directly queries the driver or is statically linked to this library and hence will work without it.
It turned out that the problem seemed to come from the installed (Snap) version. The AppImage version works without issues.