InternVL
InternVL copied to clipboard
[Bug] internvl3-1b模型vllm推理很慢
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
请问vllm推理internvl3-1b模型时,同样的数据、vllm serve启动命令和环境,要比vllm推理qwen2-vl-3b模型慢很多,前者耗时大概是后者的1.5-2倍,这个可能是什么原因造成的?谢谢。
Reproduction
启动命令: CUDA_VISIBLE_DEVICES=1 nohup vllm serve checkpoint-24280 --trust-remote-code --port 8013 --dtype bfloat16 --gpu-memory-utilization 0.8 --max-num-batched-tokens 32768 --max-num-seqs 550 --max-model-len 4096 > log_v10 2>&1 &
Environment
vllm0.8.5.post1
llama_factory 0.9.3.dev0
torch 2.6.0
cuda 12.1
Error traceback
可以看一下我们的模型输出是不是更长了,这个也会很大程度影响推理效率的