vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Roadmap] vLLM Roadmap Q2 2025

Open simon-mo opened this issue 8 months ago • 3 comments

This page is accessible via roadmap.vllm.ai

This is a living document! For each item here, we intend to link the RFC as well as discussion Slack channel in the vLLM Slack


Core Themes

Path to vLLM v1.0.0
We want to fully remove the V0 engine and clean up the codebase for unpopular and unsupported features. The v1.0.0 version of vLLM will be performant and easy to maintain, as well as modular and extensible, with backward compatibility.

  • [ ] V1 core feature set
    • [ ] Hybrid memory allocators
    • [ ] Jump decoding
    • [ ] Redesigned native support for pipeline parallelism
    • [ ] Redesigned spec decode
    • [ ] Redesigned sampler with modularity support
  • [ ] Close the feature gaps and fully remove V0
    • [ ] Attention backends
    • [ ] Pooling models
    • [ ] Mamba/Hybrid models
    • [ ] (TBD) encoder and encoder decoder
    • [ ] Hardware support
  • [ ] Performance
    • [ ] Further lower scheduler overhead
    • [ ] Further enhance LoRA performance
    • [ ] API Server Scale-out

Cluster Scale Serving
As the model expands in size, serving them in multi-node scale-out and disaggregating prefill and decode becomes the way to go. We are fully committed to making vLLM the best engine for cluster scale serving.

  • [ ] Data Parallelism
    • [ ] Single node DP
    • [ ] API Server and Engine decoupling (any to any communication)
  • [ ] Expert Parallelism
    • [ ] DeepEP or other library integrations
    • [ ] Transition from fused_moe to cutlass based grouped gemm.
  • [ ] Online Reconfiguration (e.g. EPLB)
    • [ ] Online reconfiguration
    • [ ] Zero-overhead expert movement
  • [ ] Prefill Decode Disaggregation
    • [ ] 1P1D in V1: both symmetric TP/PP and asymmetric TP/PP
    • [ ] XPYD
    • [ ] Data Parallel Compatibility
    • [ ] NIXL integration
    • [ ] Overhead Reduction & Performance Enhancements
  • [ ] KV Cache Storage
    • [ ] Offload KV cache to CPU
    • [ ] Offload KV cache to disk
    • [ ] Integration with Mooncake and LMCache
  • [ ] DeepSeek Specific Enhancements
    • [ ] MLA enhancements: TP, FlashAttention, FlashInfer, Blackwell Kernels.
    • [ ] MTP enhancements: V1 support, further lower overhead.
  • [ ] Others
    • [ ] Investigate communication and compute pipelining

vLLM for Production
vLLM is designed for production. We will continue to enhance stability and tune the systems around vLLM for optimal performance.

  • [ ] Testing:
    • [ ] Comprehensive performance suite
    • [ ] Enhance accuracy testing coverage
    • [ ] Large-scale deployment + testing
    • [ ] Stress and longevity testing
  • [ ] Offer tuned recipes and analysis for different models and hardware combinations.
  • [ ] Multi-platform wheels and containers for production use cases.

Features

Models

  • [ ] Scaling Omni Modality
  • [ ] Long Context
  • [ ] Stable OOT model registration interface
  • [ ] Attention Sparsity: support the sparse mechanism for new models.

Use Case

  • [ ] Enhance testing and performance related to RLHF workflow
  • [ ] Add data parallel routing for large-scale batch Inference
  • [ ] Investigate batch size invariance and tran/inference equivalence.

Hardware

  • [ ] Stable Plugin Architecture for hardware platforms
  • [ ] Blackwell Enhancements
  • [ ] Full Production readiness for AMD, TPU, Neuron.

Optimizations

  • [ ] EAGLE3
  • [ ] FP4 enhancements
  • [ ] FlexAttention
  • [ ] Investigate: fbgemm, torchao, cuTile
  • [ ] …

Community

  • [ ] Blogs
  • [ ] Case Studies
  • [ ] Website
  • [ ] Onboarding tasks and new contributors training program

vLLM Ecosystem


If any of the items you wanted is not on the roadmap, your suggestion and contribution is strongly welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Historical Roadmap: #11862, #9006, #5805, #3861, #2681, #244

simon-mo avatar Mar 29 '25 00:03 simon-mo

great! Thanks for the work.

And here is the Q2 roadmap of vllm-ascend https://github.com/vllm-project/vllm-ascend/issues/448 following up. Could you please add the link to Hardware or Ecosystem section? Thanks!

wangxiyuan avatar Mar 31 '25 12:03 wangxiyuan

For the v1 you should consider also the security side of it. I guess a lot of people are using vllm via the docker images, which partially are using 20.04, some 22.04. The 0.8.2 (and I haven't seen changes for 0.8.3) has nearly 50 CVEs marked as HIGH as well as more than 2500 MEDIUM.

manzke avatar Apr 10 '25 14:04 manzke

Hi! When switching to a new engine, I am very interested in how things will be with AMD ROCM support and in particular Navi 3 (rdna 3). I have been waiting for a bug fix for the codestral-mamba model for almost two months. And the model itself was released a long time ago in 2024. But it seems that no one is fixing the bug that was introduced.

https://github.com/vllm-project/vllm/issues/13678#issuecomment-2679181749

hackey avatar Apr 12 '25 22:04 hackey

It would be great to see fp8 support for sm120 (Blackwell devices) now that Cutlass has added support for sm120, sm120a as of V3.9. This would mean that Blackwell users can best take advantage of native int4 and int8 support for extra speed. Currently there is only support for sm100 and prior.

MrVolts avatar Apr 13 '25 12:04 MrVolts

Does "Redesigned spec decode" mean redesigning the implementation of v0? What are the shortcomings of v0's implementation?

skylee-01 avatar Apr 14 '25 02:04 skylee-01

【Further reduce scheduler overhead】, we tested v1 and found that the effect was quite good. Where else can optimizations be made to the scheduler?

skylee-01 avatar Apr 14 '25 02:04 skylee-01

【API Server Scale-out】 I don't understand, can you further explain it?

skylee-01 avatar Apr 14 '25 02:04 skylee-01

It is quite cool.

ANormalMan12 avatar Apr 18 '25 06:04 ANormalMan12

What does "Attention Sparsity: support the sparse mechanism for new models." refer to? Will this be block sparse attention for V1? Any other details as to what is planned here?

mklasby avatar Apr 24 '25 04:04 mklasby

Investigate communication and compute pipelining Looking forward to this update, perhaps using flux to achieve it?

double-vin avatar May 07 '25 09:05 double-vin

Regarding pipeline parallelism, are there any plans to make multinode offlince inference possible? It'd be useful for testing a large model thaqt requires multiple nodes and gpus by running inference on a local file isntead of having to use the AsyncLLMEngine

smartinezai avatar May 07 '25 14:05 smartinezai

I am so interesting about the development of moe operation, so I wonder that transition from fused_moe to cutlass based grouped gemm. Is there any benchmark shows that the cutlass based gemm is better than triton fused moe?

yongchaoding avatar May 14 '25 03:05 yongchaoding

I'm trying to understand the relationship between 'Prefill-Decode Disaggregation' and 'KV cache storage' . Considering the KVConnector in v1 has already implemented unified load/store APIs that support both scenarios

simpx avatar May 22 '25 06:05 simpx

would be great if partial-chunked-prefills (https://github.com/vllm-project/vllm/pull/10235) support in V1 in considered for the roadmap 🙏

hibukipanim avatar May 23 '25 10:05 hibukipanim

I'm curious what's the benefits of offloading kv cache to CPU ram.

Does it improve throughput? As far as I know, the gpu kv cache is the limit of batch size. Even if you offload the kv cache to CPU ram, you still need to load it back to gpu hbm to do computation. so the time is wasted in pcie traffic. And also, the traffic usually cannot overlap with computation, because it's too slow. Is that the fact?

sleepwalker2017 avatar May 30 '25 16:05 sleepwalker2017

you still need to load it back to gpu hbm to do computation. so the time is wasted in pcie traffic

@sleepwalker2017 Offloading KV cache to CPU RAM is specifically for the prefix cache - i.e. KV cache reuse. In VRAM-limited scenarios with many matching prefixes (e.g. chat use-case), it can improve throughput significantly due to fewer prefills.

It seems to have already been implemented:

  • https://github.com/vllm-project/vllm/pull/17653

@robertgshaw2-redhat @njhill Apologies if I'm pinging the wrong people here - I'm curious if there's any major blockers/considerations for merging @chunxiaozheng's implementation?

Also, looks like there's another implementation here by @mengzhu28:

  • https://github.com/vllm-project/vllm/pull/13377

josephrocca avatar Jun 15 '25 03:06 josephrocca

Hell, this is probably a bit random, but what would the viability be in supporting a distributed inference mode like llama.cpp be. At a high level, higher layers like LocalAI use this to support a P2P cluster mode, but that is just effectively discovering GRPC servers and telling a master where to find them?

I am most interested in the model weight splitting which it seems vllm might have a better way of doing things?

Thanks.

pcfreak30 avatar Jun 21 '25 15:06 pcfreak30

msccl-allreduce leads to less comm overhead than nccl-allreduce, Do we have any plans to involve this implementation? https://github.com/sgl-project/sglang/commit/8e3797be1ca9e3f0c68ff53c86e363bbfeffa268,

charles9304 avatar Jun 23 '25 06:06 charles9304

Q3 Roadmap has been published #20336

simon-mo avatar Jul 01 '25 21:07 simon-mo

Hi! would love to know which PR supports the Asymmetric TP/PP in disaggregated prefill!

novahow avatar Jul 21 '25 15:07 novahow