serve icon indicating copy to clipboard operation
serve copied to clipboard

a100-MIG multiple mig devices not supported with torchserve

Open LydiaXiaohongLi opened this issue 4 years ago • 8 comments

I have created multiple MIG devices on one machine with one nvidia a100 GPU, when i started torchserve, with no limit on number_of_gpu, I expect it to use all MIG devices created (total 6), however, only MIG DEV 0 is used.

xiaohong_li@semantic-torchserve-a100-mig:~/title_semantic_relevance_api_torchserve$ nvidia-smi Tue Sep 7 06:25:02 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB Off | 00000000:00:04.0 Off | On | | N/A 34C P0 47W / 400W | 2084MiB / 40536MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 0 3 0 0 | 2074MiB / 9984MiB | 28 0 | 2 0 1 0 0 | | | 4MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 9 0 1 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 10 0 2 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 11 0 3 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 12 0 4 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 13 0 5 | 1MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 3 0 17401 C /opt/conda/bin/python3.7 2067MiB | +-----------------------------------------------------------------------------+

Context

  • torchserve version: 0.4.2
  • torch-model-archiver version: 0.4.2
  • torch version: 1.9.0+cu111
  • java version: 11.0.12
  • Operating System and version: debian

Your Environment

  • Installed using source? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU
  • Using a default/custom handler?: custom handler
  • What kind of model is it e.g. vision, text, audio?: transformers, distilled bert
  • Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? local model
  • Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs:

Expected Behavior

torchserve to use all MIG devices, i.e. to load models on all MIG devices and do inference using all MIG devices

Current Behavior

torchserve only uses MIG device 0 (total 6)

Steps to Reproduce

  1. create a100 device with MIG enabled, and multiple MIG devices created. e.g. following https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus#mig-gpu
  2. run any sample torchserve model

Thanks!

LydiaXiaohongLi avatar Sep 07 '21 07:09 LydiaXiaohongLi

Thanks @LydiaXiaohongLi for opening this ticket, I believe this is happening as devices are assigned using gpu physical ids, as indicated in the link you shared, all the partitioned GPU are on GPU:0, you might be able to modify the device assignment in a custom handler to support the partitioned GPUs, an example of custom handler can be found here.

HamidShojanazeri avatar Sep 08 '21 03:09 HamidShojanazeri

Hihi Thanks a lot @HamidShojanazeri for your prompt reply!

Sorry may I understand more how do I do device assignment with partitioned GPUs, when all partitioned GPUs assigned with same gpu physical id?

I understand A100 MIGs, have this unique UUID for each partitioned GPU, as shown below.

xxxx@xxxx:~$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf)
  MIG 1g.5gb Device 0: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0)
  MIG 1g.5gb Device 1: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/8/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/9/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/11/0)
  MIG 1g.5gb Device 4: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/12/0)
  MIG 1g.5gb Device 5: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/13/0)
  MIG 1g.5gb Device 6: (UUID: MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/14/0)

And I am only able to use command lines with CUDA_VISIBLE_DEVICES to bring up processes that runs on different partitioned MIG device, such as below commands. however, I am not so sure, how do I integrate this with torchserve? It would be much appreciated if you could share some sample code with A100 MIG partitioned GPU? Thanks a lot!

CUDA_VISIBLE_DEVICES=MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0 python test.py CUDA_VISIBLE_DEVICES=MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/8/0 python test.py

LydiaXiaohongLi avatar Sep 12 '21 12:09 LydiaXiaohongLi

hihi @HamidShojanazeri , any updates? Thank you!

LydiaXiaohongLi avatar Sep 21 '21 08:09 LydiaXiaohongLi

@LydiaXiaohongLi sorry for the late reply. I am looking into this issue, it seems it is only available from command line Nvidia-smi and CUDA_VISBILE_DEVICES extended to support it. I am looking into a way to programmatically access this info through Nvidia-smi or other packages. It is not supported through CUDA utilities in Pytorch yet.

In the meantime if you are blocked, one hacky way that should work which is not scalable is to assign devices by iterating on the "GPU instance ID" which you are aware of it.

It seems the structure of the device_name is : MIG- "GPU-UUID" "GPU instance ID" "compute instance ID"

something like :

GPU_UUID="MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b"
GPU_instance_ID_list = [0,1,..7]

self.map_location = GPU_UUID if torch.cuda.is_available()  else "cpu"

self.device = torch.device(
          self.map_location +  "/{}/0".format(random.choice(GPU_instance_ID_list))

HamidShojanazeri avatar Sep 22 '21 05:09 HamidShojanazeri

hi @HamidShojanazeri ,

I have tried your suggestion, however, the device string format of GPU_UUID/GPU_instance_ID/0 is not valid device string. I have tried some other formats for device string, all failed for "Invalid device string".

RuntimeError: Invalid device string: 'GPU-58a224da-d632-bb7e-bdce-3e0e117ad6e4/7/0'

Does it allow you to create the torch device with UUID string format?

I am using: torchserve version: 0.4.2 torch version: 1.9.0+cu111.

Many thanks! Regards Xiaohong

LydiaXiaohongLi avatar Sep 22 '21 10:09 LydiaXiaohongLi

@LydiaXiaohongLi Sorry I think a "MIG-" was missing from the start of the string, so should be "MIG-GPU-6482a92e-d06b-dc68-c272-e3d8f7ecabbf/7/0". I will look into NVML Python Binding it might be helpful to provide available MIG devices. You might want to give it a shot as well. Here is the doc on python bidings.

HamidShojanazeri avatar Sep 22 '21 19:09 HamidShojanazeri

hihi @HamidShojanazeri, thanks for the references. Btw, I have tried various formats of device string, including with 'MIG-' prefix, still failing for the same reason.

LydiaXiaohongLi avatar Sep 23 '21 14:09 LydiaXiaohongLi

@LydiaXiaohongLi @HamidShojanazeri Hi, This issue seems to have been around for a while...Is there any update in this issue?

nickisworking avatar Jan 23 '24 17:01 nickisworking