server Doesn't allow huggingface transformers to shard 1 model across multiple GPUs

Description I would like to shard one large LLM model across multiple GPUs, but Triton wants to load separate copies of the model onto each GPU, which result in OOM.

Triton Information What version of Triton are you using? Using this docker image version: nvcr.io/nvidia/tritonserver:23.07-py3

To Reproduce I run the docker container with this command, which is from the huggingface triton tutorial:

docker run --gpus all -it --rm --net=host --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}/model_repository:/opt/tritonserver/model_repository triton_transformer_server tritonserver --model-repository=model_repository

I'm trying to run 3 models.

The first model is small, and it has the following instance_group configuration. This config is supposed to run just once execution instance on one GPU, however, it loads 4 copies of the model separately onto the 4 GPUs.

instance_group [ { count: 1 kind: KIND_GPU gpus: [0] }
]

The second model is supposed to run with just one instance on a different GPU, however, nvidia-smi reveals that triton is trying to load 4 copies of the model onto the separate GPUs.

instance_group [ { count: 1 kind: KIND_GPU gpus: [1] } ]

The LARGE model is configured as below. It is supposed to load 1 instance onto one GPU. I am hoping that the huggingface code will properly utilize device_map="auto" to shard the model onto the 4 GPUs, but instead triton starts filling up all GPUs and says it fails to load the model, then the container exits.

instance_group [ { count: 1 kind: KIND_GPU gpus: [0] }
]

Expected behavior Allow huggingface transformers model sharding when loading with device_map="auto"

Feb 15 '24 20:02 moruga123

Hi @moruga123, are you referring to https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace#deploying-using-a-triton-ensemble-approach-2 tutorial? The problem I understand is when you are setting the GPU id on the instance group, the model instance is not deploying to only the specified GPU but all GPUs on the system?

Feb 17 '24 00:02 kthui

I'm using this tutorial for inspiration: https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers

Yes, the models are getting duplicated even when instance_group has count: 1, and every model is getting deployed to every GPU (one full model per GPU).

Feb 17 '24 00:02 moruga123

I think the issue is the example Python model does not explicitly specify the device when calling the HuggingFace Transformers. I have filed a ticket for us to revisit the tutorial.

Feb 20 '24 18:02 kthui

I think the issue is the example Python model does not explicitly specify the device when calling the HuggingFace Transformers. I have filed a ticket for us to revisit the tutorial.

I don't think that would address the issue of Triton loading multiple copies of the model

Also, I would like to be able to load a sharded model

Feb 22 '24 20:02 moruga123

server server copied to clipboard

Doesn't allow huggingface transformers to shard 1 model across multiple GPUs

server
server copied to clipboard