text-embeddings-inference
text-embeddings-inference copied to clipboard
Multi GPU usage in SageMaker Inference endpoints
Feature request
TEI support for multi-GPU usage in multi-GPU instance like g5.12xlarge or p4d.24xlarge where each GPU has a copy of the embedding/reranker model.
Motivation
The TEI is able to fully utilize an A10G on a g5.xlarge (1 A10G) and multiple CPUs as well during inference benchmarks. However, when running TEI on g5.12xlarge (4 A10G's), only one of the 4 GPUs is utilized during my benchmarks. Being able to utilize multiple GPUs on the same instance can be cost-effective at scale, specifically when using instances like p4's that have 8 A100.
Your contribution
I've iterated over several benchmarks with different parameter values for 'SM_NUM_GPUS' (used in TGI) and 'SAGEMAKER_MODEL_SERVER_WORKERS' to have multiple TEI containers on the endpoints using multiple GPUs. But I continue to see only one GPU being utilized in the metrics and I get the same latency/throughput values.
hub = {
'HF_MODEL_ID':'BAAI/bge-reranker-v2-m3',
'SM_NUM_GPUS': "4"
}
# create Hugging Face Model Class
huggingface_model_four_GPU = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface-tei"),
env=hub,
role=role,
)
I'm open to contribute to the TEI SageMaker DLC if needed.