[Bug] 多模态模型pytorch后端TTFT时间比tubomind后端TTFT时间要长很多

Open huu3301 opened this issue 1 month ago • 0 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

多模态模型pytorch后端TTFT(首token时间)比tubomind后端TTFT时间要长很多。

在nvidia A6000单卡上启动Qwen2.5-VL-3B-Instruct和Qwen3-VL-4B-Instruct模型，启动命令类似： lmdeploy serve api_server /data/models/Qwen3-VL-4B-Instruct --tp 1 --model-name qwen3vl --max_batch_size 8 --session_len 25600 --cache-max-entry-count 0.6 --backend pytorch

进行图文测试（单并发请求，且设置stream=True）。 Qwen3-VL-4B：平均TTFT：4.147556 s. 平均tps：665.927719 tokens/s.

Qwen2.5vl-3B：平均TTFT：1.761191 s. 平均tps：624.012348 tokens/s.

而使用--backend tubomind方式启动的Qwen2.5vl-3B，测试：平均TTFT：0.165004 s. 平均tps：90.837745 tokens/s.

同显卡使用尽量相同参数的vllm部署测试： Qwen2.5vl-3B的TTFT: 0.063613 s，tps: 85.994672 tokens/s。 Qwen3vl-4B的TTFT: 0.083626 s，tps: 59.783212 tokens/s。

使用torch后端TTFT 的时间都比较长，对于首token要求比较严格的场景（例如聊天对话），特别是只能使用pytorch后端启动的模型，比较影响使用体验。

希望pytorch后端推理TTFT、tps能和tubomind后端推理有一致的指标（或有参数可以支持设置）。谢谢。

Reproduction

使用最新版lmdeploy代码安装 git clone https://github.com/InternLM/lmdeploy.git cd lmdeploy pip install -r requirements/build.txt pip install -e . -v

Environment

sys.platform: linux
Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
MUSA available: False
GPU 0,1: NVIDIA RTX A6000
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.93
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.8.0+cu128
PyTorch compiling details: PyTorch built with:
  - GCC 13.3
  - C++ Version: 201703
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.8
  - CuDNN 91.0.2  (built against CUDA 12.9)
    - Built with CuDNN 90.8
  - Magma 2.6.1

TorchVision: 0.23.0+cu128
LMDeploy: 0.10.2+
transformers: 4.57.1
fastapi: 0.121.1
pydantic: 2.12.4
triton: 3.4.0

Error traceback

Nov 10 '25 09:11 huu3301