gpustat icon indicating copy to clipboard operation
gpustat copied to clipboard

Is it possible to query just one GPU?

Open skat00sh opened this issue 2 years ago • 6 comments

nvidia-smi has a flag/argument to query and monitor just one GPU using nvidia-smi -id=<id> I don't see any such flag or option for gpustat

It is usually helpful when let's say out of 4 or 5 GPUs one is having issue with device drivers and on querying nvidia-smi fails but if we query the individual GPUs, then the it produces regular results for the ones that are healthy.

It would be even great if gpustat could do it by default as in:

  1. Calls nvida-smi
  2. Checks if there's an error
  3. Then sequentially checks all the available GPUs individually and produces a result

skat00sh avatar Jun 10 '22 08:06 skat00sh

Have you tried this with the latest development version? I guess when one device is failing, the latest version of gpustat would report the results for other drivers. If not we can ignore such errors and have a fallback mode (rather than querying individually and merging them).

wookayin avatar Jun 14 '22 14:06 wookayin

I'm on version 0.6.0 and I get this output on gpustat Error on querying NVIDIA devices. Use --debug flag for details

on using nvidia-smi -L I get to know that one of the GPU drivers have some issue. Unable to determine the device handle for gpu 0000:08:00.0: Unknown Error

I didn't exactly understand what fallback mode means here?

skat00sh avatar Jun 15 '22 12:06 skat00sh

@wookayin Tried the latest version from the master branch as well. Still fails one out of the 5 GPUs I've on the server has the error as I described above. Any quick work-arounds?

skat00sh avatar Jun 27 '22 12:06 skat00sh

@skat00sh Can you please provide the full output of gpustat --debug (please install the dev version, or more conveniently 1.0.0.rc1)? I'd like to see which particular exception/error has been raised in your specific case.

wookayin avatar Jul 06 '22 12:07 wookayin

Sure! Here's the output for the suggested version

(handcrafted-dp-opt) vyas@fe-computenode-2:/opt/sperl/students/devendra/projects/dp-adversarial (dev) 
$ gpustat --version
gpustat 1.0.0rc1
(handcrafted-dp-opt) vyas@fe-computenode-2:/opt/sperl/students/devendra/projects/dp-adversarial (dev) 
$ gpustat --debug
Error on querying NVIDIA devices. Use --debug flag for details
Traceback (most recent call last):
  File "/opt/sperl/students/devendra/miniconda3/envs/handcrafted-dp-opt/lib/python3.7/site-packages/gpustat/cli.py", line 20, in print_gpustat
    gpu_stats = GPUStatCollection.new_query(debug=debug)
  File "/opt/sperl/students/devendra/miniconda3/envs/handcrafted-dp-opt/lib/python3.7/site-packages/gpustat/core.py", line 537, in new_query
    handle = N.nvmlDeviceGetHandleByIndex(index)
  File "/opt/sperl/students/devendra/miniconda3/envs/handcrafted-dp-opt/lib/python3.7/site-packages/pynvml.py", line 1655, in nvmlDeviceGetHandleByIndex
    _nvmlCheckReturn(ret)
  File "/opt/sperl/students/devendra/miniconda3/envs/handcrafted-dp-opt/lib/python3.7/site-packages/pynvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_Unknown: Unknown Error

skat00sh avatar Jul 06 '22 20:07 skat00sh

@skat00sh Thanks for the information. It is strange that pynvml throws Unknown Error.

This is a special case of #81 so I rewrote #81 so that when one gpu is failing, it will display an error instead of throwing an error. Example:

[1] GeForce GTX TITAN 1 | 36°C,   0 % |  9000 / 12189 MB | user1(3000M) user3(6000M)
[2] ((Unknown Error))   |  ?°C,   ? % |     ? /     ? MB | (Not Supported)

wookayin avatar Sep 14 '22 00:09 wookayin

The (Not supported) case was fixed by #81. We may want to add -id options nonetheless.

wookayin avatar Oct 12 '22 03:10 wookayin

Added a new option --id.

e.g.

gpustat --id 0
gpustat --id 0,1,2

wookayin avatar Mar 02 '23 14:03 wookayin

Thanks! It'd be really helpful!

skat00sh avatar Mar 06 '23 00:03 skat00sh