[Bug] internvl3-1b模型vllm推理很慢

Open LZBUAV opened this issue 4 months ago • 1 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

请问vllm推理internvl3-1b模型时，同样的数据、vllm serve启动命令和环境，要比vllm推理qwen2-vl-3b模型慢很多，前者耗时大概是后者的1.5-2倍，这个可能是什么原因造成的？谢谢。

Reproduction

启动命令： CUDA_VISIBLE_DEVICES=1 nohup vllm serve checkpoint-24280 --trust-remote-code --port 8013 --dtype bfloat16 --gpu-memory-utilization 0.8 --max-num-batched-tokens 32768 --max-num-seqs 550 --max-model-len 4096 > log_v10 2>&1 &

Environment

vllm0.8.5.post1
llama_factory 0.9.3.dev0
torch 2.6.0
cuda 12.1

Error traceback

Aug 20 '25 08:08 LZBUAV

可以看一下我们的模型输出是不是更长了，这个也会很大程度影响推理效率的

Sep 02 '25 05:09 Weiyun1025