Qwen2.5
Qwen2.5 copied to clipboard
vllm 72b启动失败
为什么使用huggingface可以启动72b-chat模型直接推理,但是用vllm不行呢。报错信息是:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacty of 79.33 GiB of which 351.81 MiB is free. Including non-PyTorch memory, this process has 78.97 GiB memory in use. Of the allocated memory 78.33 GiB is allocated by PyTorch, and 12.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF。
8张A800的显存,启动huggingface版本都是OK的,但是vllm却报错OOM?
vllm启动脚本:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server
--model qwen/Qwen1.5-72B-Chat
huggingface推理脚本直接用的官方给的:https://github.com/QwenLM/Qwen1.5
same issue
请问您解决了吗
还没解决呢
多卡可能要加这个参数 --tensor-parallel-size,我用了没报oom的错了,但是有其他cuda错误
请问你使用huggingface启动72B-Chat花了多少显存,max token len是多少?谢谢
多卡可能要加这个参数 --tensor-parallel-size,我用了没报oom的错了,但是有其他cuda错误
Yeah you need this for tensor parallelism to deploy the large model on multiple devices.
previous max seq len (32768), device is A100 40G --tensor-parallel-size 4 --gpu-memory-utilization 0.92 --max-model-len 512