aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

[RFC]: Add Support for Prefill/Decode (P/D) Disaggregation in vLLM

Open Jeffwan opened this issue 7 months ago • 1 comments

Summary

To further optimize large-scale LLM inference workloads, we plan to introduce support for Prefill/Decode (P/D) disaggregation in vLLM. This separation allows prefill and decode stages to run on different GPU nodes, unlocking better resource utilization and throughput. Some community users already raise such requirements. https://github.com/vllm-project/aibrix/issues/958

Systems like MoonCake, Dynamo have already implemented P/D disaggregation. For our implementation, we are evaluating two possible backend approaches:

  • KVCache Offloading: Transfer KV states from prefill to decode via an external shared KV cache and also reuse cache.
  • Peer-to-Peer (P2P) KV Transfer: Directly transfer KV states between GPU nodes for lower latencies.

We need to carefully assess the trade-offs between these approaches (e.g., performance, deployment complexity, bandwidth usage) and choose the most suitable design for our use case.

Motivation

Optimize decoding long tail latencies and provide more reliable results for reasoning models.

Proposed Change

Phase I:

  • As the vLLM control plane, we can directly orchestrate the existing P/D solutions, including MoonCake and Dynamo solutions.

Phase II:

  • Prototype control-plane logic using existing Mooncake and Dynamo connectors for KV coordination
  • Add support for cloud-native orchestration of P/D disaggregated replicas:
    • Replica mode: xPyD mapping between prefill and decode nodes
    • Pooling mode: Decouple and scale prefill/decode as independent pools
  • Define a P/D-aware routing policy to ensure balanced traffic distribution across disaggregated nodes

Phase III:

  • Data plane innovation, evaluate the necessity to create our own specified connector for P/D case.

Alternatives Considered

No response

Jeffwan avatar May 23 '25 23:05 Jeffwan

Crucial feature, eagerly awaited.

libin817927 avatar May 27 '25 03:05 libin817927

will these all 3 phases be complete or some of them?

kdtmac avatar Jun 11 '25 02:06 kdtmac

@kdtmac Phase I and Phase II will be done in v0.4.0. Currently, we'd like to reuse NixlConnector directly and probably work on MultiConnector to support both AIBrixKVOffloading + NixlConnector.

For communication efficiency, we have some internal incubating project may perform better performance, we might move to new connector once that project is done.

Jeffwan avatar Jun 19 '25 09:06 Jeffwan

Assign myself for tracking the P/D Disaggregation Inference GW part.

Xunzhuo avatar Jun 23 '25 03:06 Xunzhuo

This needs to consider compatibility with vllm+lmcache+mooncake and sglang+mooncake. For multiple p and multiple d, how to improve p and d communication? Otherwise some same request how to reuse the d.

ying2025 avatar Jun 24 '25 10:06 ying2025

@ying2025 the router will select the right p/d pair for communication, we do have some algorithms can do better jobs than default vllm/sglang setting. framework support focus more on the engine orchestration, we hope to encapsulate the details inside the engine and make it transparent to orchestrator.

Otherwise some same request how to reuse the d.

could you elaborate more on above? reuse d for what purpose? do you mean reuse the kv cache generated from d?

Jeffwan avatar Jun 24 '25 11:06 Jeffwan

@ying2025 the router will select the right p/d pair for communication, we do have some algorithms can do better jobs than default vllm/sglang setting. framework support focus more on the engine orchestration, we hope to encapsulate the details inside the engine and make it transparent to orchestrator.

Otherwise some same request how to reuse the d.

could you elaborate more on above? reuse d for what purpose? do you mean reuse the kv cache generated from d?

For the same request session (multi-turn conversation or continuous requests), maintain P/D stickiness by routing back to the original processing node/worker.

ying2025 avatar Jun 27 '25 06:06 ying2025

@ying2025 sorry for late, it makes sense, prefix cache is considered in the routing decision to reduce TTFT

Jeffwan avatar Aug 01 '25 17:08 Jeffwan

there're some remaining works, autoscaling/podgroup and some other community offerings will be supported in future release.

The current P/D orchestration and routing should be good for v0.4.0 release. We can close this issue now.

Jeffwan avatar Aug 01 '25 17:08 Jeffwan

@kdtmac Phase I and Phase II will be done in v0.4.0. Currently, we'd like to reuse NixlConnector directly and probably work on MultiConnector to support both AIBrixKVOffloading + NixlConnector.

For communication efficiency, we have some internal incubating project may perform better performance, we might move to new connector once that project is done.

hi @Jeffwan, could you please give me a sample that how to use MultiConnector combines AIBrixKVOffloading and NixlConnector?maybe like bellow?

command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            ......
            - --kv-transfer-config
            - '{"kv_connector":"MultiConnector", "kv_role":"kv_both", "kv_connector_extra_config": {"connectors": [{ "kv_connector": "NixlConnector", "kv_role": "kv_both" },{ "kv_connector": "AIBrixOffloadingConnectorV1Type1", "kv_role": "kv_both" }]}}'
          env:
            - name: VLLM_USE_V1
              value: "1"
            - name: AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED
              value: "1"
            # specify the eviction policy, default is S3FIFO
            - name: AIBRIX_KV_CACHE_OL_L1_CACHE_EVICTION_POLICY
              value: "S3FIFO"
            # specify the capacity of L1 cache, default is 10GB
            - name: AIBRIX_KV_CACHE_OL_L1_CACHE_CAPACITY_GB
              value: "80"
            - name: VLLM_RPC_TIMEOUT
              value: "1000000"
            - name: VLLM_SERVER_DEV_MODE
              value: "1"
            - name: VLLM_NIXL_SIDE_CHANNEL_PORT
               value: "5558"
            - name: VLLM_WORKER_MULTIPROC_METHOD
               value: spawn
            - name: VLLM_ENABLE_V1_MULTIPROCESSING
               value: "0"

and then it can transfer kv by nixl firstly,save to offloading secondly?

and more,in the pd example,kv_role dose not fix producer and consumer, does it done by StormService?

thanks!

ltm920716 avatar Aug 07 '25 10:08 ltm920716

@ltm920716 multiconnector is the roadmap but I personally didn't get chance to work on it yet. We will come back to you soon on this support. BTW, is this an urgent issue for you? We will try to make it asap /cc @DwyaneShi

Jeffwan avatar Aug 07 '25 20:08 Jeffwan

thank you @Jeffwan, it not an urgent issue. vllm supports nixl and lm-cache(with nixl and offloading), it looks that lm-cache a better solution for kvcache?so I asked if aibrix has the same solution to make a comparison.

ltm920716 avatar Aug 08 '25 01:08 ltm920716