Qwen2.5 vllm 72b启动失败

为什么使用huggingface可以启动72b-chat模型直接推理，但是用vllm不行呢。报错信息是：torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacty of 79.33 GiB of which 351.81 MiB is free. Including non-PyTorch memory, this process has 78.97 GiB memory in use. Of the allocated memory 78.33 GiB is allocated by PyTorch, and 12.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF。

8张A800的显存，启动huggingface版本都是OK的，但是vllm却报错OOM?

vllm启动脚本： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server
--model qwen/Qwen1.5-72B-Chat

huggingface推理脚本直接用的官方给的：https://github.com/QwenLM/Qwen1.5