[Bug] 多模态模型pytorch后端TTFT时间比tubomind后端TTFT时间要长很多
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
多模态模型pytorch后端TTFT(首token时间)比tubomind后端TTFT时间要长很多。
在nvidia A6000单卡上启动Qwen2.5-VL-3B-Instruct和Qwen3-VL-4B-Instruct模型,启动命令类似: lmdeploy serve api_server /data/models/Qwen3-VL-4B-Instruct --tp 1 --model-name qwen3vl --max_batch_size 8 --session_len 25600 --cache-max-entry-count 0.6 --backend pytorch
进行图文测试(单并发请求,且设置stream=True)。 Qwen3-VL-4B: 平均TTFT:4.147556 s. 平均tps:665.927719 tokens/s.
Qwen2.5vl-3B: 平均TTFT:1.761191 s. 平均tps:624.012348 tokens/s.
而使用--backend tubomind方式启动的Qwen2.5vl-3B,测试: 平均TTFT:0.165004 s. 平均tps:90.837745 tokens/s.
同显卡使用尽量相同参数的vllm部署测试: Qwen2.5vl-3B的TTFT: 0.063613 s,tps: 85.994672 tokens/s。 Qwen3vl-4B的TTFT: 0.083626 s,tps: 59.783212 tokens/s。
使用torch后端TTFT 的时间都比较长,对于首token要求比较严格的场景(例如聊天对话),特别是只能使用pytorch后端启动的模型,比较影响使用体验。
希望pytorch后端推理TTFT、tps能和tubomind后端推理有一致的指标(或有参数可以支持设置)。 谢谢。
Reproduction
使用最新版lmdeploy代码安装 git clone https://github.com/InternLM/lmdeploy.git cd lmdeploy pip install -r requirements/build.txt pip install -e . -v
Environment
sys.platform: linux
Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
MUSA available: False
GPU 0,1: NVIDIA RTX A6000
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.93
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.8.0+cu128
PyTorch compiling details: PyTorch built with:
- GCC 13.3
- C++ Version: 201703
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.8
- CuDNN 91.0.2 (built against CUDA 12.9)
- Built with CuDNN 90.8
- Magma 2.6.1
TorchVision: 0.23.0+cu128
LMDeploy: 0.10.2+
transformers: 4.57.1
fastapi: 0.121.1
pydantic: 2.12.4
triton: 3.4.0
Error traceback