[RFC]: Add Support for Prefill/Decode (P/D) Disaggregation in vLLM
Summary
To further optimize large-scale LLM inference workloads, we plan to introduce support for Prefill/Decode (P/D) disaggregation in vLLM. This separation allows prefill and decode stages to run on different GPU nodes, unlocking better resource utilization and throughput. Some community users already raise such requirements. https://github.com/vllm-project/aibrix/issues/958
Systems like MoonCake, Dynamo have already implemented P/D disaggregation. For our implementation, we are evaluating two possible backend approaches:
- KVCache Offloading: Transfer KV states from prefill to decode via an external shared KV cache and also reuse cache.
- Peer-to-Peer (P2P) KV Transfer: Directly transfer KV states between GPU nodes for lower latencies.
We need to carefully assess the trade-offs between these approaches (e.g., performance, deployment complexity, bandwidth usage) and choose the most suitable design for our use case.
Motivation
Optimize decoding long tail latencies and provide more reliable results for reasoning models.
Proposed Change
Phase I:
- As the vLLM control plane, we can directly orchestrate the existing P/D solutions, including MoonCake and Dynamo solutions.
Phase II:
- Prototype control-plane logic using existing Mooncake and Dynamo connectors for KV coordination
- Add support for cloud-native orchestration of P/D disaggregated replicas:
- Replica mode: xPyD mapping between prefill and decode nodes
- Pooling mode: Decouple and scale prefill/decode as independent pools
- Define a P/D-aware routing policy to ensure balanced traffic distribution across disaggregated nodes
Phase III:
- Data plane innovation, evaluate the necessity to create our own specified connector for P/D case.
Alternatives Considered
No response
Crucial feature, eagerly awaited.
will these all 3 phases be complete or some of them?
@kdtmac Phase I and Phase II will be done in v0.4.0. Currently, we'd like to reuse NixlConnector directly and probably work on MultiConnector to support both AIBrixKVOffloading + NixlConnector.
For communication efficiency, we have some internal incubating project may perform better performance, we might move to new connector once that project is done.
Assign myself for tracking the P/D Disaggregation Inference GW part.
This needs to consider compatibility with vllm+lmcache+mooncake and sglang+mooncake. For multiple p and multiple d, how to improve p and d communication? Otherwise some same request how to reuse the d.
@ying2025 the router will select the right p/d pair for communication, we do have some algorithms can do better jobs than default vllm/sglang setting. framework support focus more on the engine orchestration, we hope to encapsulate the details inside the engine and make it transparent to orchestrator.
Otherwise some same request how to reuse the d.
could you elaborate more on above? reuse d for what purpose? do you mean reuse the kv cache generated from d?
@ying2025 the router will select the right p/d pair for communication, we do have some algorithms can do better jobs than default vllm/sglang setting. framework support focus more on the engine orchestration, we hope to encapsulate the details inside the engine and make it transparent to orchestrator.
Otherwise some same request how to reuse the d.
could you elaborate more on above? reuse d for what purpose? do you mean reuse the kv cache generated from d?
For the same request session (multi-turn conversation or continuous requests), maintain P/D stickiness by routing back to the original processing node/worker.
@ying2025 sorry for late, it makes sense, prefix cache is considered in the routing decision to reduce TTFT
there're some remaining works, autoscaling/podgroup and some other community offerings will be supported in future release.
The current P/D orchestration and routing should be good for v0.4.0 release. We can close this issue now.
@kdtmac Phase I and Phase II will be done in v0.4.0. Currently, we'd like to reuse
NixlConnectordirectly and probably work on MultiConnector to support both AIBrixKVOffloading + NixlConnector.For communication efficiency, we have some internal incubating project may perform better performance, we might move to new connector once that project is done.
hi @Jeffwan, could you please give me a sample that how to use MultiConnector combines AIBrixKVOffloading and NixlConnector?maybe like bellow?
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
......
- --kv-transfer-config
- '{"kv_connector":"MultiConnector", "kv_role":"kv_both", "kv_connector_extra_config": {"connectors": [{ "kv_connector": "NixlConnector", "kv_role": "kv_both" },{ "kv_connector": "AIBrixOffloadingConnectorV1Type1", "kv_role": "kv_both" }]}}'
env:
- name: VLLM_USE_V1
value: "1"
- name: AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED
value: "1"
# specify the eviction policy, default is S3FIFO
- name: AIBRIX_KV_CACHE_OL_L1_CACHE_EVICTION_POLICY
value: "S3FIFO"
# specify the capacity of L1 cache, default is 10GB
- name: AIBRIX_KV_CACHE_OL_L1_CACHE_CAPACITY_GB
value: "80"
- name: VLLM_RPC_TIMEOUT
value: "1000000"
- name: VLLM_SERVER_DEV_MODE
value: "1"
- name: VLLM_NIXL_SIDE_CHANNEL_PORT
value: "5558"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: VLLM_ENABLE_V1_MULTIPROCESSING
value: "0"
and then it can transfer kv by nixl firstly,save to offloading secondly?
and more,in the pd example,kv_role dose not fix producer and consumer, does it done by StormService?
thanks!
@ltm920716 multiconnector is the roadmap but I personally didn't get chance to work on it yet. We will come back to you soon on this support. BTW, is this an urgent issue for you? We will try to make it asap /cc @DwyaneShi
thank you @Jeffwan, it not an urgent issue. vllm supports nixl and lm-cache(with nixl and offloading), it looks that lm-cache a better solution for kvcache?so I asked if aibrix has the same solution to make a comparison.