gpustat
gpustat copied to clipboard
Is it possible to query just one GPU?
nvidia-smi
has a flag/argument to query and monitor just one GPU using
nvidia-smi -id=<id>
I don't see any such flag or option for gpustat
It is usually helpful when let's say out of 4 or 5 GPUs one is having issue with device drivers and on querying nvidia-smi
fails but if we query the individual GPUs, then the it produces regular results for the ones that are healthy.
It would be even great if gpustat
could do it by default as in:
- Calls
nvida-smi
- Checks if there's an error
- Then sequentially checks all the available GPUs individually and produces a result
Have you tried this with the latest development version? I guess when one device is failing, the latest version of gpustat would report the results for other drivers. If not we can ignore such errors and have a fallback mode (rather than querying individually and merging them).
I'm on version 0.6.0
and I get this output on gpustat
Error on querying NVIDIA devices. Use --debug flag for details
on using nvidia-smi -L
I get to know that one of the GPU drivers have some issue.
Unable to determine the device handle for gpu 0000:08:00.0: Unknown Error
I didn't exactly understand what fallback mode means here?
@wookayin Tried the latest version from the master branch as well. Still fails one out of the 5 GPUs I've on the server has the error as I described above. Any quick work-arounds?
@skat00sh Can you please provide the full output of gpustat --debug
(please install the dev version, or more conveniently 1.0.0.rc1
)? I'd like to see which particular exception/error has been raised in your specific case.
Sure! Here's the output for the suggested version
(handcrafted-dp-opt) vyas@fe-computenode-2:/opt/sperl/students/devendra/projects/dp-adversarial (dev)
$ gpustat --version
gpustat 1.0.0rc1
(handcrafted-dp-opt) vyas@fe-computenode-2:/opt/sperl/students/devendra/projects/dp-adversarial (dev)
$ gpustat --debug
Error on querying NVIDIA devices. Use --debug flag for details
Traceback (most recent call last):
File "/opt/sperl/students/devendra/miniconda3/envs/handcrafted-dp-opt/lib/python3.7/site-packages/gpustat/cli.py", line 20, in print_gpustat
gpu_stats = GPUStatCollection.new_query(debug=debug)
File "/opt/sperl/students/devendra/miniconda3/envs/handcrafted-dp-opt/lib/python3.7/site-packages/gpustat/core.py", line 537, in new_query
handle = N.nvmlDeviceGetHandleByIndex(index)
File "/opt/sperl/students/devendra/miniconda3/envs/handcrafted-dp-opt/lib/python3.7/site-packages/pynvml.py", line 1655, in nvmlDeviceGetHandleByIndex
_nvmlCheckReturn(ret)
File "/opt/sperl/students/devendra/miniconda3/envs/handcrafted-dp-opt/lib/python3.7/site-packages/pynvml.py", line 765, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_Unknown: Unknown Error
@skat00sh Thanks for the information. It is strange that pynvml throws Unknown Error
.
This is a special case of #81 so I rewrote #81 so that when one gpu is failing, it will display an error instead of throwing an error. Example:
[1] GeForce GTX TITAN 1 | 36°C, 0 % | 9000 / 12189 MB | user1(3000M) user3(6000M)
[2] ((Unknown Error)) | ?°C, ? % | ? / ? MB | (Not Supported)
The (Not supported) case was fixed by #81. We may want to add -id
options nonetheless.
Added a new option --id
.
e.g.
gpustat --id 0
gpustat --id 0,1,2
Thanks! It'd be really helpful!