vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Roadmap] vLLM Roadmap Q4 2025

Open simon-mo opened this issue 1 month ago • 10 comments

This page is accessible via roadmap.vllm.ai

This is a living document! For each item here, we intend to link the RFC as well as discussion Slack channel in the vLLM Slack


In the Q3 2025, we fully removed V0 code path and made vLLM excel in large scale serving with mature wide E and prefill disaggregation. In this quarter, our goal is to continue to drive down the CPU overhead, enhance vLLM on frontier clusters, and strengthen our RL integrations.

We list help wanted item as 🙋in areas that the committer group is seeking more dedicated contributions.

Engine Core

Large Scale Serving

Reinforcement Learning (#sig-post-training)

  • [ ] Full determinism and batch invariance
  • [ ] Add more testing cases for popular integrations
  • [ ] Custom checkpoint loader, custom model format
  • [ ] Simple data parallel router for scale out
  • [ ] 🙋Enhance weight loading speed for syncing and resharding
  • [ ] 🙋Study a way to enable multi-turn long horizon scheduling to avoid preemption

Performance and UX Enhancement

  • [ ] Continue to drive down startup time (#feat-startup-ux)
  • [ ] Refactor tool use parsing to leverage grammar structural tag (#feat-tool-calling)
  • [ ] Refactor CI (#sig-ci, #ci-sprint)
  • [ ] Turn on torch compile fusion by default, no extra flags on default case (#sig-torch-compile)
  • [ ] Prefix caching for Hybrid models (https://github.com/vllm-project/vllm/issues/26201)
  • [ ] 🙋Model Bash: profile and optimize newer model architectures on different hardwares (NVIDIA Hopper, Blackwell, AMD MI3xx) (#sig-model-bash)
    • DeepSeek V3.2
    • Qwen3MoE, Qwen3 VL, Qwen3 Next
    • gpt-oss

If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #20336, #15735, #11862, #9006, #5805, #3861, #2681, #244

simon-mo avatar Oct 07 '25 19:10 simon-mo

An important feature from V0 which wasn't yet implemented in V1 is concurrent partial prefills. Please consider prioritizing it 🙏 An issue that tracks it: https://github.com/vllm-project/vllm/issues/21674 Thanks!

hibukipanim avatar Oct 08 '25 08:10 hibukipanim

Should we consider supporting E/P/D disaggregation for large-scale multimodal model serving? It's a beneficial feature for large-batch or encode-compute-heavy MLLM deployment scenarios. https://github.com/vllm-project/vllm/pull/25233

SamitHuang avatar Oct 10 '25 03:10 SamitHuang

I hope we could land speculative decoding with draft models this quarter. Posting here to raise awareness :)

https://github.com/vllm-project/vllm/pull/24322

tomasruizt avatar Oct 11 '25 10:10 tomasruizt

What's the plan for the proxy/router component used in disaggregated deployments?

jianzs avatar Oct 12 '25 10:10 jianzs

Image Investigate communication and compute pipelining Looking forward to this update, perhaps using flux to achieve it?

double-vin avatar Oct 23 '25 08:10 double-vin

Awesome.

Oliver66661 avatar Oct 31 '25 18:10 Oliver66661

what do ”elastic epxert“ means?

lw921014 avatar Nov 03 '25 08:11 lw921014

"elastic expert" is a specialist with deep knowledge of the Elastic Stack (Elasticsearch, Kibana, Logstash, Beats) who can design, deploy, manage, and troubleshoot solutions using these technologies.

Oliver66661 avatar Nov 03 '25 08:11 Oliver66661

Can we please get Flash Attention 3 support for RTX6000 Blackwell GPUs? Really want to try native support for FP4 computation.

jman0815 avatar Nov 03 '25 08:11 jman0815

Official support for DGX Spark.

swtb3-ryder avatar Nov 10 '25 09:11 swtb3-ryder