Qwen2.5 icon indicating copy to clipboard operation
Qwen2.5 copied to clipboard

vllm 72b启动失败

Open icevivian opened this issue 1 year ago • 7 comments

为什么使用huggingface可以启动72b-chat模型直接推理,但是用vllm不行呢。报错信息是:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacty of 79.33 GiB of which 351.81 MiB is free. Including non-PyTorch memory, this process has 78.97 GiB memory in use. Of the allocated memory 78.33 GiB is allocated by PyTorch, and 12.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF。

8张A800的显存,启动huggingface版本都是OK的,但是vllm却报错OOM?

vllm启动脚本: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server
--model qwen/Qwen1.5-72B-Chat

huggingface推理脚本直接用的官方给的:https://github.com/QwenLM/Qwen1.5

icevivian avatar Feb 21 '24 09:02 icevivian

same issue

zay95 avatar Feb 22 '24 07:02 zay95

请问您解决了吗

1920853199 avatar Feb 26 '24 07:02 1920853199

还没解决呢

icevivian avatar Feb 27 '24 08:02 icevivian

多卡可能要加这个参数 --tensor-parallel-size,我用了没报oom的错了,但是有其他cuda错误

HuipengXu avatar Mar 06 '24 06:03 HuipengXu

请问你使用huggingface启动72B-Chat花了多少显存,max token len是多少?谢谢

an-old-guy-in-Ecust avatar Mar 21 '24 16:03 an-old-guy-in-Ecust

多卡可能要加这个参数 --tensor-parallel-size,我用了没报oom的错了,但是有其他cuda错误

Yeah you need this for tensor parallelism to deploy the large model on multiple devices.

JustinLin610 avatar Mar 23 '24 01:03 JustinLin610

previous max seq len (32768), device is A100 40G --tensor-parallel-size 4 --gpu-memory-utilization 0.92 --max-model-len 512

Modas-Li avatar Apr 02 '24 06:04 Modas-Li