vllm icon indicating copy to clipboard operation
vllm copied to clipboard

How to specify which GPU the model inference on?

Open zoubaihan opened this issue 1 year ago • 1 comments

Hello, I have 4 GPUs. And when I set tensor_parallel_size as 2, when running the service, it would takes CUDA:0 and CUDA:1.

My question is, if I want start two workers(i.e. two process that deploy two same models), how to specify my second process takes on CUDA:2 and CUDA:3?

Cuz now if I just start service without any config, it will OOM.

zoubaihan avatar Jul 04 '23 06:07 zoubaihan

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --tensor-parallel-size 2 --host 127.0.0.1

MasKong avatar Jul 04 '23 06:07 MasKong

I have one question on my servers. It seems that when cuda:0 is almost full, it still fail to do so, by passing the parameters "CUDA_VISIBLE_DEVICES"?

MM-IR avatar Jul 15 '23 12:07 MM-IR

Oh, I find that they are still taking the first two GPUs by ray::worker when I specify other two.

MM-IR avatar Jul 15 '23 12:07 MM-IR

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --tensor-parallel-size 2 --host 127.0.0.1

@MasKong Can you bit elaborate this? This is my simple codebase and I want to use 1 and 3 gpus.

llm = LLM(model_name, max_model_len=50, tensor_parallel_size=2)
output = llm.generate(text)

You can find complete issue here

humza-sami avatar Feb 23 '24 19:02 humza-sami

Closing in preference to #3012

hmellor avatar Mar 20 '24 12:03 hmellor