vllm
vllm copied to clipboard
vLLM Distributed Inference stuck when using multi -GPU
I am trying to run inferece server on multi GPU using this on (4 * NVIDIA GeForce RTX 3090) server.
python -u -m vllm.entrypoints.api_server --host 0.0.0.0 --model mistralai/Mistral-7B-Instruct-v0.2 --tensor-parallel-size 4
while this works fine when using --tensor-parallel-size =1 , but on using tensor-parallel-size >1 it stuck on strat up.
Thanks
this is happening to me too, on 2 * 3090
try these parameters
--gpu-memory-utilization 0.7~0.9
--max-model-len 8192
try these parameters
--gpu-memory-utilization 0.7~0.9--max-model-len 8192
hello, I have tried the method you provided, but it has no effect.
No effect here either
Did you found a solution i Ve the same issue ?
@BilalKHA95 try this
export NCCL_P2P_DISABLE=1
this woked for me
@BilalKHA95 try this
export NCCL_P2P_DISABLE=1
this woked for me
Thank's !!! it's working now, this env variable + update cuda tooltik to 12.3
export NCCL_P2P_DISABLE=1
This also solved this issue for me.
@BilalKHA95 try this export NCCL_P2P_DISABLE=1 this woked for me
Thank's !!! it's working now, this env variable + update cuda tooltik to 12.3
Hi! does this result in higher tokens/second for you ? (for a small model like: -model mistralai/Mistral-7B-Instruct-v0.2 --tensor-parallel-size 4) ? thanks!
This didn't work for me:
export NCCL_P2P_DISABLE=1
Is there any solutions?
Thank you guys very much in advance!
Best regards,
Shuyue June 9th, 2024
We have added documentation for this situation in #5430. Please take a look.