[Roadmap] vLLM Roadmap Q2 2025
This page is accessible via roadmap.vllm.ai
This is a living document! For each item here, we intend to link the RFC as well as discussion Slack channel in the vLLM Slack
Core Themes
Path to vLLM v1.0.0
We want to fully remove the V0 engine and clean up the codebase for unpopular and unsupported features. The v1.0.0 version of vLLM will be performant and easy to maintain, as well as modular and extensible, with backward compatibility.
- [ ] V1 core feature set
- [ ] Hybrid memory allocators
- [ ] Jump decoding
- [ ] Redesigned native support for pipeline parallelism
- [ ] Redesigned spec decode
- [ ] Redesigned sampler with modularity support
- [ ] Close the feature gaps and fully remove V0
- [ ] Attention backends
- [ ] Pooling models
- [ ] Mamba/Hybrid models
- [ ] (TBD) encoder and encoder decoder
- [ ] Hardware support
- [ ] Performance
- [ ] Further lower scheduler overhead
- [ ] Further enhance LoRA performance
- [ ] API Server Scale-out
Cluster Scale Serving
As the model expands in size, serving them in multi-node scale-out and disaggregating prefill and decode becomes the way to go. We are fully committed to making vLLM the best engine for cluster scale serving.
- [ ] Data Parallelism
- [ ] Single node DP
- [ ] API Server and Engine decoupling (any to any communication)
- [ ] Expert Parallelism
- [ ] DeepEP or other library integrations
- [ ] Transition from fused_moe to cutlass based grouped gemm.
- [ ] Online Reconfiguration (e.g. EPLB)
- [ ] Online reconfiguration
- [ ] Zero-overhead expert movement
- [ ] Prefill Decode Disaggregation
- [ ] 1P1D in V1: both symmetric TP/PP and asymmetric TP/PP
- [ ] XPYD
- [ ] Data Parallel Compatibility
- [ ] NIXL integration
- [ ] Overhead Reduction & Performance Enhancements
- [ ] KV Cache Storage
- [ ] Offload KV cache to CPU
- [ ] Offload KV cache to disk
- [ ] Integration with Mooncake and LMCache
- [ ] DeepSeek Specific Enhancements
- [ ] MLA enhancements: TP, FlashAttention, FlashInfer, Blackwell Kernels.
- [ ] MTP enhancements: V1 support, further lower overhead.
- [ ] Others
- [ ] Investigate communication and compute pipelining
vLLM for Production
vLLM is designed for production. We will continue to enhance stability and tune the systems around vLLM for optimal performance.
- [ ] Testing:
- [ ] Comprehensive performance suite
- [ ] Enhance accuracy testing coverage
- [ ] Large-scale deployment + testing
- [ ] Stress and longevity testing
- [ ] Offer tuned recipes and analysis for different models and hardware combinations.
- [ ] Multi-platform wheels and containers for production use cases.
Features
Models
- [ ] Scaling Omni Modality
- [ ] Long Context
- [ ] Stable OOT model registration interface
- [ ] Attention Sparsity: support the sparse mechanism for new models.
Use Case
- [ ] Enhance testing and performance related to RLHF workflow
- [ ] Add data parallel routing for large-scale batch Inference
- [ ] Investigate batch size invariance and tran/inference equivalence.
Hardware
- [ ] Stable Plugin Architecture for hardware platforms
- [ ] Blackwell Enhancements
- [ ] Full Production readiness for AMD, TPU, Neuron.
Optimizations
- [ ] EAGLE3
- [ ] FP4 enhancements
- [ ] FlexAttention
- [ ] Investigate: fbgemm, torchao, cuTile
- [ ] …
Community
- [ ] Blogs
- [ ] Case Studies
- [ ] Website
- [ ] Onboarding tasks and new contributors training program
vLLM Ecosystem
-
Hardware Plugins
- vllm-ascend: https://github.com/vllm-project/vllm-ascend/issues/448
-
Production Stack: https://github.com/vllm-project/production-stack/issues/300
-
LLM Compressor
-
GuideLLM
-
Dynamo
-
Prioritized Support for RLHF Systems: veRL, OpenRLHF, TRL, OpenInstruct, Fairseq2, ...
If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.
Historical Roadmap: #11862, #9006, #5805, #3861, #2681, #244
great! Thanks for the work.
And here is the Q2 roadmap of vllm-ascend https://github.com/vllm-project/vllm-ascend/issues/448 following up. Could you please add the link to Hardware or Ecosystem section? Thanks!
For the v1 you should consider also the security side of it. I guess a lot of people are using vllm via the docker images, which partially are using 20.04, some 22.04. The 0.8.2 (and I haven't seen changes for 0.8.3) has nearly 50 CVEs marked as HIGH as well as more than 2500 MEDIUM.
Hi! When switching to a new engine, I am very interested in how things will be with AMD ROCM support and in particular Navi 3 (rdna 3). I have been waiting for a bug fix for the codestral-mamba model for almost two months. And the model itself was released a long time ago in 2024. But it seems that no one is fixing the bug that was introduced.
https://github.com/vllm-project/vllm/issues/13678#issuecomment-2679181749
It would be great to see fp8 support for sm120 (Blackwell devices) now that Cutlass has added support for sm120, sm120a as of V3.9. This would mean that Blackwell users can best take advantage of native int4 and int8 support for extra speed. Currently there is only support for sm100 and prior.
Does "Redesigned spec decode" mean redesigning the implementation of v0? What are the shortcomings of v0's implementation?
【Further reduce scheduler overhead】, we tested v1 and found that the effect was quite good. Where else can optimizations be made to the scheduler?
【API Server Scale-out】 I don't understand, can you further explain it?
It is quite cool.
What does "Attention Sparsity: support the sparse mechanism for new models." refer to? Will this be block sparse attention for V1? Any other details as to what is planned here?
Investigate communication and compute pipelining Looking forward to this update, perhaps using flux to achieve it?
Regarding pipeline parallelism, are there any plans to make multinode offlince inference possible? It'd be useful for testing a large model thaqt requires multiple nodes and gpus by running inference on a local file isntead of having to use the AsyncLLMEngine
I am so interesting about the development of moe operation, so I wonder that transition from fused_moe to cutlass based grouped gemm. Is there any benchmark shows that the cutlass based gemm is better than triton fused moe?
I'm trying to understand the relationship between 'Prefill-Decode Disaggregation' and 'KV cache storage' . Considering the KVConnector in v1 has already implemented unified load/store APIs that support both scenarios
would be great if partial-chunked-prefills (https://github.com/vllm-project/vllm/pull/10235) support in V1 in considered for the roadmap 🙏
I'm curious what's the benefits of offloading kv cache to CPU ram.
Does it improve throughput? As far as I know, the gpu kv cache is the limit of batch size. Even if you offload the kv cache to CPU ram, you still need to load it back to gpu hbm to do computation. so the time is wasted in pcie traffic. And also, the traffic usually cannot overlap with computation, because it's too slow. Is that the fact?
you still need to load it back to gpu hbm to do computation. so the time is wasted in pcie traffic
@sleepwalker2017 Offloading KV cache to CPU RAM is specifically for the prefix cache - i.e. KV cache reuse. In VRAM-limited scenarios with many matching prefixes (e.g. chat use-case), it can improve throughput significantly due to fewer prefills.
It seems to have already been implemented:
- https://github.com/vllm-project/vllm/pull/17653
@robertgshaw2-redhat @njhill Apologies if I'm pinging the wrong people here - I'm curious if there's any major blockers/considerations for merging @chunxiaozheng's implementation?
Also, looks like there's another implementation here by @mengzhu28:
- https://github.com/vllm-project/vllm/pull/13377
Hell, this is probably a bit random, but what would the viability be in supporting a distributed inference mode like llama.cpp be. At a high level, higher layers like LocalAI use this to support a P2P cluster mode, but that is just effectively discovering GRPC servers and telling a master where to find them?
I am most interested in the model weight splitting which it seems vllm might have a better way of doing things?
Thanks.
msccl-allreduce leads to less comm overhead than nccl-allreduce, Do we have any plans to involve this implementation? https://github.com/sgl-project/sglang/commit/8e3797be1ca9e3f0c68ff53c86e363bbfeffa268,
Q3 Roadmap has been published #20336
Hi! would love to know which PR supports the Asymmetric TP/PP in disaggregated prefill!