vllm
vllm copied to clipboard
How to specify which GPU the model inference on?
Hello, I have 4 GPUs. And when I set tensor_parallel_size
as 2, when running the service, it would takes CUDA:0 and CUDA:1.
My question is, if I want start two workers(i.e. two process that deploy two same models), how to specify my second process takes on CUDA:2 and CUDA:3?
Cuz now if I just start service without any config, it will OOM.
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --tensor-parallel-size 2 --host 127.0.0.1
I have one question on my servers. It seems that when cuda:0 is almost full, it still fail to do so, by passing the parameters "CUDA_VISIBLE_DEVICES"?
Oh, I find that they are still taking the first two GPUs by ray::worker when I specify other two.
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --tensor-parallel-size 2 --host 127.0.0.1
@MasKong Can you bit elaborate this? This is my simple codebase and I want to use 1 and 3 gpus.
llm = LLM(model_name, max_model_len=50, tensor_parallel_size=2)
output = llm.generate(text)
You can find complete issue here
Closing in preference to #3012