Mooncake [Bug]: Uneven Network Utilization in Multi-Node Prefill: Only Ray Head Node Transmits KV Cache

Bug Report

Environment

Model: DeepSeek-V3

Architecture: Disaggregated Prefill-Decode (1P1D)

Hardware (Per Node): 8x H200, 8x ConnectX-7 (Backend/Inter-GPU), 1x BlueField-3 (Frontend/KV Transfer).

Cluster Topology:

Prefill Instance (P): 2 Physical Nodes (configured as 1 Ray Cluster: 1 Head + 1 Worker).

Decode Instance (D): 1 Physical Node.

Prefill Parallelism Config: DP16, EP16 (spanning across the 2 Prefill nodes).

Integration Reference: Followed the vLLM integration guide.

Configuration

We explicitly configured the BF3 interface in the Mooncake config to handle KV cache transmission between the Prefill and Decode instances, separating it from the backend CX7 traffic.

Observed Behavior

During the prefill phase, we monitored network traffic on the specified BF3 interfaces. We observed that only the BF3 interface on the Ray Head node is transmitting KV cache traffic to the Decode instance.

The BF3 interface on the non-head (worker) node remains idle regarding Mooncake KV transfer traffic, despite the KV cache being distributed across GPUs on both nodes due to the DP parallelism strategies.

Questions

Is this the expected behavior? Does the current implementation require the Ray Head node to aggregate KV cache from worker nodes before transmission (which would introduce a significant bottleneck), or is it designed to support distributed p2p transfer from all participating nodes?

Configuration Check: Are there specific flags or configurations required to enable non-head Ray workers to establish direct transport channels for KV cache transmission?

Before submitting...

[ ] Ensure you searched for relevant issues and read the [documentation]

Nov 26 '25 09:11 JayFzh

Hello, I'd like to ask how to launch the multi-machine multi-GPU service. @JayFzh

Nov 26 '25 09:11 txh1873749380

Hi @JayFzh I'm not very familiar with Ray, but Mooncake transfer engine doesn't care whether vllm uses MP or Ray and support P2P transfer between any node. I recommend confirming with the vllm community whether the current implementation requires Ray Head nodes to aggregate KV caches from worker node.

Nov 26 '25 14:11 staryxchen

Due to the save_only_first_rank optimization in lmcache, the cache is only saved on the first tp.

Dec 03 '25 01:12 SpecterCipher