serve
serve copied to clipboard
NVML_ERROR_NOT_SUPPORTED exception
🐛 Describe the bug
Sometimes it can occur that NVML does not support monitoring queries to specific devices. Currently this leads to failing the startup phase.
Error logs
2022-07-04T12:33:15,023 [ERROR] Thread-20 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "ts/metrics/metric_collector.py", line 27, in <module>
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 91, in collect_all
value(num_of_gpu)
File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 72, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in <listcomp>
return [device_status(device_index) for device_index in range(device_count)]
File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 26, in device_status
temperature = nv.nvmlDeviceGetTemperature(handle, nv.NVML_TEMPERATURE_GPU)
File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 1956, in nvmlDeviceGetTemperature
_nvmlCheckReturn(ret)
File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
Installation instructions
pytorch/torchserve:latest-gpu
Model Packaing
N/A
config.properties
No response
Versions
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:
torchserve==0.6.0
torch-model-archiver==0.6.0
Python version: 3.6 (64-bit runtime)
Python executable: /usr/bin/python3
Versions of relevant python libraries:
future==0.18.2
numpy==1.19.5
nvgpu==0.9.0
psutil==5.9.1
requests==2.27.1
torch-model-archiver==0.6.0
torch-workflow-archiver==0.2.4
torchserve==0.6.0
wheel==0.30.0
**Warning: torch not present ..
**Warning: torchtext not present ..
**Warning: torchvision not present ..
**Warning: torchaudio not present ..
Java Version:
OS: N/A
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: N/A
CMake version: N/A
Repro instructions
run:
torchserve --start --foreground --model-store model-store/
Possible Solution
Deal with those exceptions.
Thanks for opening this, which specific devices are you referring to? Is it an older NVIDIA GPU? An AMD GPU? something els? EDIT: This seems to be a somewhat known issue https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165. We can produce a better workaround
nvidia smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100DX-40C On | 00000000:00:05.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This is a virtual GPU. It seems that some features like temperature monitoring might not be supported for these virtual devices. See for instance page 118 of https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf.
@msaroufim If you approve for an upstream bug-fix I'd be happy to help.
@msaroufim any update on this?
Hi @lromor I'm not sure what the right fix is yet. It does like seem like this is problem introduced by NVIDIA pynvml.nvml.NVMLError_NotSupported: Not Supported so I believe your best best is commenting on https://forums.developer.nvidia.com/t/bug-nvml-incorrectly-detects-certain-gpus-as-unsupported/30165 which will give someone on the team some buffer to take a look
Hi @msaroufim , I've opened an issue here: https://forums.developer.nvidia.com/t/nvml-issue-with-virtual-a100/220718?u=lromor
In case anyone gets to a similar issue and would like to have a quick fix, I patched the code with:
diff --git a/ts/metrics/system_metrics.py b/ts/metrics/system_metrics.py
index c7aaf6a..9915c9e 100644
--- a/ts/metrics/system_metrics.py
+++ b/ts/metrics/system_metrics.py
@@ -7,6 +7,7 @@ from builtins import str
import psutil
from ts.metrics.dimension import Dimension
from ts.metrics.metric import Metric
+import pynvml
system_metrics = []
dimension = [Dimension('Level', 'Host')]
@@ -69,7 +70,11 @@ def gpu_utilization(num_of_gpu):
system_metrics.append(Metric('GPUMemoryUtilization', value['mem_used_percent'], 'percent', dimension_gpu))
system_metrics.append(Metric('GPUMemoryUsed', value['mem_used'], 'MB', dimension_gpu))
- statuses = list_gpus.device_statuses()
+ try:
+ statuses = list_gpus.device_statuses()
+ except pynvml.nvml.NVMLError_NotSupported:
+ statuses = []
+
for idx, status in enumerate(statuses):
dimension_gpu = [Dimension('Level', 'Host'), Dimension("device_id", idx)]
system_metrics.append(Metric('GPUUtilization', status['utilization'], 'percent', dimension_gpu))
I think this is the right solution. Wanna make a PR for it? May just need to add a logging warning as well