fastertransformer_backend
fastertransformer_backend copied to clipboard
How to deploy multiple model in a node with multople GPUs
Description
Suppose I have 5 GPT models with each TP=2 and I want to deploy them in a machine with 8 GPUs. Is it possible? If so, how to control the GPU allocation? I tried to set CUDA_VISIBLE_DEVICES when launch the Triton server does not work.
Reproduced Steps
Tried CUDA_VISIBLE_DEVICES