serve Metrics collector crashes when NVIDIA MIGs are present

🐛 Describe the bug

I was configuring the pytorch/torchserve:0.10.0-gpu image of docker to deploy a model to production and Ive encountered the following issue. The thing is that the package nvgpu used by the metrics collector, fails to work with the NVIDIA MIG technology, and it crashes the thread.

After a bit of investigation, the culprit is the nvgpu.gpu_info() function, that tries to parse the nvidia-smi output. In a normal GPU, it works fine since it tries to grab the Memory-Usage field (5ish line, column 2):

urko@port-urkoa:~$ nvidia-smi 
Tue Apr 16 08:35:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060        Off | 00000000:01:00.0  On |                  N/A |
| N/A   68C    P0              38W /  80W |   3091MiB /  6144MiB |     18%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3317      G   /usr/lib/xorg/Xorg                         1955MiB |
|    0   N/A  N/A      3632      G   /usr/bin/gnome-shell                        279MiB |
|    0   N/A  N/A      4915      G   ...seed-version=20240414-180149.278000      327MiB |
|    0   N/A  N/A      5778      G   ...erProcess --variations-seed-version      477MiB |
+---------------------------------------------------------------------------------------+

However, as the MIG technology changes the nvidia-smi command, it looks like this:

root@torchserve-depl-6479499d9f-8p8j7:/home/model-server# nvidia-smi 
Tue Apr 16 06:27:22 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:03:00.0 Off |                   On |
| N/A   74C    P0    63W / 250W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    4   0   0  |   1141MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      2MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So when nvgpu tries to parse the Memory-Usage field, its gets N/A and tries to convert it to an integer, and thats the error I get.

Error logs

The error I get in the main logs:

2024-04-16T06:25:41,915 [ERROR] Thread-14 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
  File "/home/venv/lib/python3.9/site-packages/ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 119, in collect_all
    value(num_of_gpu)
  File "/home/venv/lib/python3.9/site-packages/ts/metrics/system_metrics.py", line 71, in gpu_utilization
    info = nvgpu.gpu_info()
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in gpu_info
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in <listcomp>
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
ValueError: invalid literal for int() with base 10: 'N'

Installation instructions

I used a dockerfile with torchserve as base:

FROM pytorch/torchserve:0.10.0-gpu
ENV DEBIAN_FRONTEND=noninteractive
USER 0
RUN apt update && apt install -y python3-opencv python3-pip git build-essential
ENV PYTHONUNBUFFERED=1
RUN pip install opencv-python torchvision torch torchaudio timm numpy scikit-learn matplotlib seaborn pandas
RUN pip install 'git+https://github.com/facebookresearch/detectron2.git'

Model Packaing

Standard mar. Doesnt apply.

config.properties

default_workers_per_model=1

Versions

docker standard pytorch/torchserve:0.10.0-gpu

Repro instructions

The steps to reproduced it (its MANDATORY to have MIGs):

root@torchserve-depl-6479499d9f-8p8j7:/home/model-server# python3   
Python 3.9.18 (main, Aug 25 2023, 13:20:04) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nvgpu
>>> nvgpu.gpu_info()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in gpu_info
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
  File "/home/venv/lib/python3.9/site-packages/nvgpu/__init__.py", line 22, in <listcomp>
    mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
ValueError: invalid literal for int() with base 10: 'N'
>>>

Possible Solution

I know its a bug of nvgpu, not from torchserve, but AFAIK, nvgpu is no longer being maitained so it might be a good chance to change the package or way to work. Just my suggestion. Thanks

Apr 16 '24 06:04 UrkoAT

@UrkoAT Thank you for investigating the root cause. We do notice that there are some bugs in nvgpu which is also in maintenance mode. TS v0.10.0 provides a feature which is able to customize system metrics (see PR)

Apr 16 '24 16:04 lxning

I also faced this problem. To solve it, I made my own gpu_info function, based in the nvgpu's version...

import re
import six
import subprocess

def _run_cmd(cmd):
    output = subprocess.check_output(cmd)
    if six.PY3:
        output = output.decode('UTF-8')
    return output

def gpu_info():
    nvsmi = _run_cmd(['nvidia-smi', '-L'])
    pieces = nvsmi.split(' ')
    mig_mode = 'MIG' in pieces and 'Device' in pieces
    #
    gpus = [line for line in nvsmi.split('\n') if line and line.startswith('GPU')]
    gpu_infos = [re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups() for gpu in gpus]
    gpu_infos = [dict(zip(['index', 'type', 'uuid'], info)) for info in gpu_infos]
    gpu_count = len(gpus)
    #
    lines = _run_cmd(['nvidia-smi'])
    cuda_version = float(lines.split('CUDA Version:')[1].strip().split(' ')[0])
    #
    if not mig_mode:
        lines = lines.split('\n')
        if cuda_version < 11:
            line_distance = 3
            selected_lines = lines[7:7 + line_distance * gpu_count]
        else:
            line_distance = 4
            selected_lines = lines[8:8 + line_distance * gpu_count]
        for i in range(gpu_count):
            mem_used, mem_total = [int(m.strip().replace('MiB', '')) for m in
                                selected_lines[line_distance * i + 1].split('|')[2].strip().split('/')]
            gpu_infos[i]['mem_used'] = mem_used
            gpu_infos[i]['mem_total'] = mem_total
            gpu_infos[i]['mem_used_percent'] = 100. * mem_used / mem_total
    else:
        lines = lines.replace('|',' ')
        while '  ' in lines: lines = lines.replace('  ',' ')
        lines = lines.split('\n')
        for i in range(len(lines)-1):
            if 'MiB' in lines[i] and 'MiB' in lines[i+1]:
                pieces = lines[i].strip().split('MiB')
                pieces1 = pieces[0].split(' ')
                pieces2 = pieces[1].split(' ')
                gpuid = int(pieces1[0])
                mem_used = int(pieces1[-1])
                mem_total = int(pieces2[-1])
                if 'mem_used' in gpu_infos[gpuid]:
                    gpu_infos[gpuid]['mem_used']  += mem_used
                    gpu_infos[gpuid]['mem_total'] += mem_total
                else:
                    gpu_infos[gpuid]['mem_used']  = mem_used
                    gpu_infos[gpuid]['mem_total'] = mem_total
                gpu_infos[gpuid]['mem_used_percent'] = 100. * gpu_infos[gpuid]['mem_used'] / gpu_infos[gpuid]['mem_total']
    return gpu_infos

Jul 23 '24 12:07 alesoumac