[PD] Support CPU fallback for mooncake
Motivation
Mooncake is now the default transfer engine supporting PD separation for SGLang. One current limitation is that the user's GPU device must support GDR (GPUDirect RDMA) capability. While this provides better performance, it restricts users with older hardware who want to experience the latest SGLang features. To address this, this PR introduces a CPU fallback path to enable Mooncake-based transmission of KVCache when GDR is unavailable.
Typically, one might manually allocate CPU buffers via CUDA for GPU memory copying, but Mooncake has already anticipated this need by providing a dedicated API (allocate_managed_buffer). This PR leverages that API to implement the fallback functionality. Importantly, this enhancement is fully transparent to users.
Cc @stmatengss @ShangmingCai
Checklist
- [ ] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
@hnyls2002 Hi, the mooncake team is implementing a new data transfer path for PD disaggregation without GDR support. Please help us review it. Thanks!
@XucSh Please resolve the conflicts
@XucSh Please resolve the conflicts
Working on it.
@hnyls2002 PTAL
Do we still need RDMA devices in the same node to try out this pr? Do you have some suggestion if we don't have RDMA device but we have 1 server several cards using DMA device for PCIE to PCIE copy?
Do we still need RDMA devices in the same node to try out this pr? Do you have some suggestion if we don't have RDMA device but we have 1 server several cards using DMA device for PCIE to PCIE copy?
The Mooncake now supports 'cudaCopy' operation for local transfer. You can use the latest version
Do we still need RDMA devices in the same node to try out this pr? Do you have some suggestion if we don't have RDMA device but we have 1 server several cards using DMA device for PCIE to PCIE copy?
The Mooncake now supports 'cudaCopy' operation for local transfer. You can use the latest version
Thanks a lot, do you know which PR supports this or any quick tryout documentation?
@XucSh Hi, I just tested nixl on a machine without GDR devices, and it works fine, the output is like this
[2025-07-26 08:00:37] 31.22.104.21 [26/Jul/2025:07:00:37 -0800] "PUT /route HTTP/1.1" 200 155 "-" "python-requests/2.32.4"
ucp_context.c:2339 UCX INFO Version 1.19.0 (loaded from /usr/local/lib/python3.10/dist-packages/.nixl.mesonpy.libs/plugins/../../nixl.libs/libucp-7a15df9c.so.0.0.0)
ucp_context.c:2339 UCX INFO Version 1.19.0 (loaded from /usr/local/lib/python3.10/dist-packages/.nixl.mesonpy.libs/plugins/../../nixl.libs/libucp-7a15df9c.so.0.0.0)
rdmacm_cm.c:950 UCX DIAG rdma_create_event_channel failed: No such device
ucp_worker.c:1587 UCX DIAG failed to open CM on component rdmacm with status Input/output error
parser.c:2368 UCX INFO UCX_* env variables: UCX_TLS=all UCX_LOG_LEVEL=info
ucp_worker.c:1903 UCX INFO ucp_context_0 self cfg#1 rma_am(tcp/lo) amo_am(tcp/lo) am(tcp/lo) ka(tcp/lo)
Backend UCX was instantiated
Initialized NIXL agent: 64798f51-03f7-476b-8d29-9a9c0eb9b122
rdmacm_cm.c:950 UCX DIAG rdma_create_event_channel failed: No such device
ucp_worker.c:1587 UCX DIAG failed to open CM on component rdmacm with status Input/output error
parser.c:2368 UCX INFO UCX_* env variables: UCX_TLS=all UCX_LOG_LEVEL=info
ucp_worker.c:1903 UCX INFO ucp_context_0 self cfg#1 rma_am(tcp/lo) amo_am(tcp/lo) am(tcp/lo) ka(tcp/lo)
Backend UCX was instantiated
Initialized NIXL agent: c43fadd2-dace-4482-a7c5-eb888e14b753
This means that the CPU or TCP backend was just dispatched by NIXL(UCX). I think we can just use NIXL as the CPU fallback when there are no RDMA devices.
@XucSh Hi, I just tested
nixlon a machine without GDR devices, and it works fine, the output is like this[2025-07-26 08:00:37] 31.22.104.21 [26/Jul/2025:07:00:37 -0800] "PUT /route HTTP/1.1" 200 155 "-" "python-requests/2.32.4" ucp_context.c:2339 UCX INFO Version 1.19.0 (loaded from /usr/local/lib/python3.10/dist-packages/.nixl.mesonpy.libs/plugins/../../nixl.libs/libucp-7a15df9c.so.0.0.0) ucp_context.c:2339 UCX INFO Version 1.19.0 (loaded from /usr/local/lib/python3.10/dist-packages/.nixl.mesonpy.libs/plugins/../../nixl.libs/libucp-7a15df9c.so.0.0.0) rdmacm_cm.c:950 UCX DIAG rdma_create_event_channel failed: No such device ucp_worker.c:1587 UCX DIAG failed to open CM on component rdmacm with status Input/output error parser.c:2368 UCX INFO UCX_* env variables: UCX_TLS=all UCX_LOG_LEVEL=info ucp_worker.c:1903 UCX INFO ucp_context_0 self cfg#1 rma_am(tcp/lo) amo_am(tcp/lo) am(tcp/lo) ka(tcp/lo) Backend UCX was instantiated Initialized NIXL agent: 64798f51-03f7-476b-8d29-9a9c0eb9b122 rdmacm_cm.c:950 UCX DIAG rdma_create_event_channel failed: No such device ucp_worker.c:1587 UCX DIAG failed to open CM on component rdmacm with status Input/output error parser.c:2368 UCX INFO UCX_* env variables: UCX_TLS=all UCX_LOG_LEVEL=info ucp_worker.c:1903 UCX INFO ucp_context_0 self cfg#1 rma_am(tcp/lo) amo_am(tcp/lo) am(tcp/lo) ka(tcp/lo) Backend UCX was instantiated Initialized NIXL agent: c43fadd2-dace-4482-a7c5-eb888e14b753This means that the CPU or TCP backend was just dispatched by NIXL(UCX). I think we can just use NIXL as the CPU fallback when there are no RDMA devices.
@hnyls2002 你好,是通过指定--disaggregation-transfer-backend nixl来使用nixl吗?另外是否需要提前pip 安装哪些依赖包?
