server
server copied to clipboard
Doesn't allow huggingface transformers to shard 1 model across multiple GPUs
Description I would like to shard one large LLM model across multiple GPUs, but Triton wants to load separate copies of the model onto each GPU, which result in OOM.
Triton Information What version of Triton are you using? Using this docker image version: nvcr.io/nvidia/tritonserver:23.07-py3
To Reproduce I run the docker container with this command, which is from the huggingface triton tutorial:
docker run --gpus all -it --rm --net=host --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}/model_repository:/opt/tritonserver/model_repository triton_transformer_server tritonserver --model-repository=model_repository
I'm trying to run 3 models.
The first model is small, and it has the following instance_group configuration. This config is supposed to run just once execution instance on one GPU, however, it loads 4 copies of the model separately onto the 4 GPUs.
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
}
]
The second model is supposed to run with just one instance on a different GPU, however, nvidia-smi reveals that triton is trying to load 4 copies of the model onto the separate GPUs.
instance_group [ { count: 1 kind: KIND_GPU gpus: [1] } ]
The LARGE model is configured as below. It is supposed to load 1 instance onto one GPU. I am hoping that the huggingface code will properly utilize device_map="auto" to shard the model onto the 4 GPUs, but instead triton starts filling up all GPUs and says it fails to load the model, then the container exits.
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
}
]
Expected behavior Allow huggingface transformers model sharding when loading with device_map="auto"
Hi @moruga123, are you referring to https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace#deploying-using-a-triton-ensemble-approach-2 tutorial? The problem I understand is when you are setting the GPU id on the instance group, the model instance is not deploying to only the specified GPU but all GPUs on the system?
I'm using this tutorial for inspiration: https://github.com/triton-inference-server/tutorials/tree/main/Quick_Deploy/HuggingFaceTransformers
Yes, the models are getting duplicated even when instance_group has count: 1, and every model is getting deployed to every GPU (one full model per GPU).
I think the issue is the example Python model does not explicitly specify the device when calling the HuggingFace Transformers. I have filed a ticket for us to revisit the tutorial.
I think the issue is the example Python model does not explicitly specify the device when calling the HuggingFace Transformers. I have filed a ticket for us to revisit the tutorial.
I don't think that would address the issue of Triton loading multiple copies of the model
Also, I would like to be able to load a sharded model