text-embeddings-inference icon indicating copy to clipboard operation
text-embeddings-inference copied to clipboard

Multi GPU usage in SageMaker Inference endpoints

Open kandakji opened this issue 1 year ago • 0 comments

Feature request

TEI support for multi-GPU usage in multi-GPU instance like g5.12xlarge or p4d.24xlarge where each GPU has a copy of the embedding/reranker model.

Motivation

The TEI is able to fully utilize an A10G on a g5.xlarge (1 A10G) and multiple CPUs as well during inference benchmarks. However, when running TEI on g5.12xlarge (4 A10G's), only one of the 4 GPUs is utilized during my benchmarks. Being able to utilize multiple GPUs on the same instance can be cost-effective at scale, specifically when using instances like p4's that have 8 A100.

Your contribution

I've iterated over several benchmarks with different parameter values for 'SM_NUM_GPUS' (used in TGI) and 'SAGEMAKER_MODEL_SERVER_WORKERS' to have multiple TEI containers on the endpoints using multiple GPUs. But I continue to see only one GPU being utilized in the metrics and I get the same latency/throughput values.

hub = {
	'HF_MODEL_ID':'BAAI/bge-reranker-v2-m3',
    'SM_NUM_GPUS': "4"
}

# create Hugging Face Model Class
huggingface_model_four_GPU = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface-tei"),
	env=hub,
	role=role,
)

I'm open to contribute to the TEI SageMaker DLC if needed.

kandakji avatar Jul 20 '24 05:07 kandakji