serve a100-MIG multiple mig devices not supported with torchserve

I have created multiple MIG devices on one machine with one nvidia a100 GPU, when i started torchserve, with no limit on number_of_gpu, I expect it to use all MIG devices created (total 6), however, only MIG DEV 0 is used.

+-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 0 3 0 0 | 2074MiB / 9984MiB | 28 0 | 2 0 1 0 0 | | | 4MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 9 0 1 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 10 0 2 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 11 0 3 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 12 0 4 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 13 0 5 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 3 0 17401 C /opt/conda/bin/python3.7 2067MiB | +-----------------------------------------------------------------------------+

Context

torchserve version: 0.4.2
torch-model-archiver version: 0.4.2
torch version: 1.9.0+cu111
java version: 11.0.12
Operating System and version: debian

Your Environment

Installed using source? [yes/no]: no
Is it a CPU or GPU environment?: GPU
Using a default/custom handler?: custom handler
What kind of model is it e.g. vision, text, audio?: transformers, distilled bert
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? local model
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs:

Expected Behavior

torchserve to use all MIG devices, i.e. to load models on all MIG devices and do inference using all MIG devices

Current Behavior

torchserve only uses MIG device 0 (total 6)

Steps to Reproduce

create a100 device with MIG enabled, and multiple MIG devices created. e.g. following https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus#mig-gpu
run any sample torchserve model

Thanks!

Sep 07 '21 07:09 LydiaXiaohongLi

Thanks @LydiaXiaohongLi for opening this ticket, I believe this is happening as devices are assigned using gpu physical ids, as indicated in the link you shared, all the partitioned GPU are on GPU:0, you might be able to modify the device assignment in a custom handler to support the partitioned GPUs, an example of custom handler can be found here.

Sep 08 '21 03:09 HamidShojanazeri

Hihi Thanks a lot @HamidShojanazeri for your prompt reply!

Sorry may I understand more how do I do device assignment with partitioned GPUs, when all partitioned GPUs assigned with same gpu physical id?

I understand A100 MIGs, have this unique UUID for each partitioned GPU, as shown below.

xxxx@xxxx:~$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/14/0)

And I am only able to use command lines with CUDA_VISIBLE_DEVICES to bring up processes that runs on different partitioned MIG device, such as below commands. however, I am not so sure, how do I integrate this with torchserve? It would be much appreciated if you could share some sample code with A100 MIG partitioned GPU? Thanks a lot!

CUDA_VISIBLE_DEVICES=MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0 python test.py CUDA_VISIBLE_DEVICES=MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/8/0 python test.py

Sep 12 '21 12:09 LydiaXiaohongLi

hihi @HamidShojanazeri , any updates? Thank you!

Sep 21 '21 08:09 LydiaXiaohongLi

@LydiaXiaohongLi sorry for the late reply. I am looking into this issue, it seems it is only available from command line Nvidia-smi and CUDA_VISBILE_DEVICES extended to support it. I am looking into a way to programmatically access this info through Nvidia-smi or other packages. It is not supported through CUDA utilities in Pytorch yet.

In the meantime if you are blocked, one hacky way that should work which is not scalable is to assign devices by iterating on the "GPU instance ID" which you are aware of it.

It seems the structure of the device_name is : MIG- "GPU-UUID" "GPU instance ID" "compute instance ID"

something like :

GPU_UUID="MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b"
GPU_instance_ID_list = [0,1,..7]

self.map_location = GPU_UUID if torch.cuda.is_available()  else "cpu"

self.device = torch.device(
          self.map_location +  "/{}/0".format(random.choice(GPU_instance_ID_list))

Sep 22 '21 05:09 HamidShojanazeri

hi @HamidShojanazeri ,

I have tried your suggestion, however, the device string format of GPU_UUID/GPU_instance_ID/0 is not valid device string. I have tried some other formats for device string, all failed for "Invalid device string".

RuntimeError: Invalid device string: 'GPU-58a224da-d632-bb7e-bdce-3e0e117ad6e4/7/0'

Does it allow you to create the torch device with UUID string format?

I am using: torchserve version: 0.4.2 torch version: 1.9.0+cu111.

Many thanks! Regards Xiaohong

Sep 22 '21 10:09 LydiaXiaohongLi

@LydiaXiaohongLi Sorry I think a "MIG-" was missing from the start of the string, so should be "MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0". I will look into NVML Python Binding it might be helpful to provide available MIG devices. You might want to give it a shot as well. Here is the doc on python bidings.

Sep 22 '21 19:09 HamidShojanazeri

hihi @HamidShojanazeri, thanks for the references. Btw, I have tried various formats of device string, including with 'MIG-' prefix, still failing for the same reason.

Sep 23 '21 14:09 LydiaXiaohongLi

@LydiaXiaohongLi @HamidShojanazeri Hi, This issue seems to have been around for a while...Is there any update in this issue?

Jan 23 '24 17:01 nickisworking