vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Multimodal] Optimize Qwen2/2.5-VL startup time

Open WoosukKwon opened this issue 6 months ago • 4 comments

Essential Elements of an Effective PR Description Checklist

  • [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ ] The test plan, such as providing test command.
  • [ ] The test results, such as pasting the results comparison before and after, or e2e results
  • [ ] (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Currently, processing large dummy inputs takes 40 secs of the startup time for Qwen2/2.5-VL (it happens twice and each takes 20 secs). This can be skipped by pre-computing the maximum token count per modality.

Test Plan

Test Result

(Optional) Documentation Update

WoosukKwon avatar Jun 17 '25 16:06 WoosukKwon

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Jun 17 '25 16:06 github-actions[bot]

@DarkLight1337 Thanks for sharing it! In my experiment, this PR reduces the startup time of Qwen2.5-VL-3B from 120 secs to 55 secs. It definitely helps.

That said, I'm not sure if the pre-computed values should depend on the limit_mm_per_prompt paramter.

WoosukKwon avatar Jun 17 '25 17:06 WoosukKwon

@DarkLight1337 @WoosukKwon Here's a short repro script - let me know if this is reasonable.

import time
from vllm import LLM

st = time.perf_counter()
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", enforce_eager=True)
print("Time taken", time.perf_counter() - st)

Results below are 10 rounds average - profiling is done with dummy video input

  • Main branch: 29.343433478847146
  • Woosuk's initial commit 3fe1893: 15.75505773164332
  • Updated version of this branch without hardcoded values: 15.77296781912446

Adding some constraints with limit_mm_per_prompt={"video": 0} so that profiling is done with dummy image input

  • Main branch: 16.037972562015057
  • Woosuk's initial commit 3fe1893: 15.723176507279277
  • Updated version of this branch without hardcoded values: 15.553956482559443

I think this means there are something wrong with caching the processed video inputs? Probably also has something to do with serialization. Will do more digging to verify.

ywang96 avatar Jun 19 '25 08:06 ywang96

@ywang96 Thanks for the investigation. Didn't know that it is caused by the video input. 🤔

WoosukKwon avatar Jun 19 '25 18:06 WoosukKwon