sglang [PD] Support CPU fallback for mooncake

Motivation

Mooncake is now the default transfer engine supporting PD separation for SGLang. One current limitation is that the user's GPU device must support GDR (GPUDirect RDMA) capability. While this provides better performance, it restricts users with older hardware who want to experience the latest SGLang features. To address this, this PR introduces a CPU fallback path to enable Mooncake-based transmission of KVCache when GDR is unavailable.

Typically, one might manually allocate CPU buffers via CUDA for GPU memory copying, but Mooncake has already anticipated this need by providing a dedicated API (allocate_managed_buffer). This PR leverages that API to implement the fallback functionality. Importantly, this enhancement is fully transparent to users.

Cc @stmatengss @ShangmingCai

Checklist

[ ] Format your code according to the Code Formatting with Pre-Commit.
[ ] Add unit tests as outlined in the Running Unit Tests.
[ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Apr 23 '25 07:04 XucSh

@hnyls2002 Hi, the mooncake team is implementing a new data transfer path for PD disaggregation without GDR support. Please help us review it. Thanks!

Apr 23 '25 16:04 stmatengss

@XucSh Please resolve the conflicts

May 26 '25 03:05 hnyls2002

@XucSh Please resolve the conflicts

Working on it.

May 26 '25 06:05 XucSh

@hnyls2002 PTAL

May 30 '25 02:05 XucSh

Do we still need RDMA devices in the same node to try out this pr? Do you have some suggestion if we don't have RDMA device but we have 1 server several cards using DMA device for PCIE to PCIE copy?

Jul 09 '25 15:07 zjamy

Do we still need RDMA devices in the same node to try out this pr? Do you have some suggestion if we don't have RDMA device but we have 1 server several cards using DMA device for PCIE to PCIE copy?

The Mooncake now supports 'cudaCopy' operation for local transfer. You can use the latest version

Jul 11 '25 07:07 XucSh

Do we still need RDMA devices in the same node to try out this pr? Do you have some suggestion if we don't have RDMA device but we have 1 server several cards using DMA device for PCIE to PCIE copy?

The Mooncake now supports 'cudaCopy' operation for local transfer. You can use the latest version

Thanks a lot, do you know which PR supports this or any quick tryout documentation?

Jul 23 '25 13:07 zjamy

@XucSh Hi, I just tested nixl on a machine without GDR devices, and it works fine, the output is like this

[2025-07-26 08:00:37] 31.22.104.21 [26/Jul/2025:07:00:37 -0800] "PUT /route HTTP/1.1" 200 155 "-" "python-requests/2.32.4"
ucp_context.c:2339 UCX  INFO  Version 1.19.0 (loaded from /usr/local/lib/python3.10/dist-packages/.nixl.mesonpy.libs/plugins/../../nixl.libs/libucp-7a15df9c.so.0.0.0)
ucp_context.c:2339 UCX  INFO  Version 1.19.0 (loaded from /usr/local/lib/python3.10/dist-packages/.nixl.mesonpy.libs/plugins/../../nixl.libs/libucp-7a15df9c.so.0.0.0)
rdmacm_cm.c:950  UCX  DIAG  rdma_create_event_channel failed: No such device
ucp_worker.c:1587 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
parser.c:2368 UCX  INFO  UCX_* env variables: UCX_TLS=all UCX_LOG_LEVEL=info
ucp_worker.c:1903 UCX  INFO    ucp_context_0 self cfg#1 rma_am(tcp/lo)  amo_am(tcp/lo)  am(tcp/lo)  ka(tcp/lo)
Backend UCX was instantiated
Initialized NIXL agent: 64798f51-03f7-476b-8d29-9a9c0eb9b122
rdmacm_cm.c:950  UCX  DIAG  rdma_create_event_channel failed: No such device
ucp_worker.c:1587 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
parser.c:2368 UCX  INFO  UCX_* env variables: UCX_TLS=all UCX_LOG_LEVEL=info
ucp_worker.c:1903 UCX  INFO    ucp_context_0 self cfg#1 rma_am(tcp/lo)  amo_am(tcp/lo)  am(tcp/lo)  ka(tcp/lo)
Backend UCX was instantiated
Initialized NIXL agent: c43fadd2-dace-4482-a7c5-eb888e14b753

This means that the CPU or TCP backend was just dispatched by NIXL(UCX). I think we can just use NIXL as the CPU fallback when there are no RDMA devices.

Jul 26 '25 15:07 hnyls2002

@XucSh Hi, I just tested nixl on a machine without GDR devices, and it works fine, the output is like this

[2025-07-26 08:00:37] 31.22.104.21 [26/Jul/2025:07:00:37 -0800] "PUT /route HTTP/1.1" 200 155 "-" "python-requests/2.32.4"
ucp_context.c:2339 UCX  INFO  Version 1.19.0 (loaded from /usr/local/lib/python3.10/dist-packages/.nixl.mesonpy.libs/plugins/../../nixl.libs/libucp-7a15df9c.so.0.0.0)
ucp_context.c:2339 UCX  INFO  Version 1.19.0 (loaded from /usr/local/lib/python3.10/dist-packages/.nixl.mesonpy.libs/plugins/../../nixl.libs/libucp-7a15df9c.so.0.0.0)
rdmacm_cm.c:950  UCX  DIAG  rdma_create_event_channel failed: No such device
ucp_worker.c:1587 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
parser.c:2368 UCX  INFO  UCX_* env variables: UCX_TLS=all UCX_LOG_LEVEL=info
ucp_worker.c:1903 UCX  INFO    ucp_context_0 self cfg#1 rma_am(tcp/lo)  amo_am(tcp/lo)  am(tcp/lo)  ka(tcp/lo)
Backend UCX was instantiated
Initialized NIXL agent: 64798f51-03f7-476b-8d29-9a9c0eb9b122
rdmacm_cm.c:950  UCX  DIAG  rdma_create_event_channel failed: No such device
ucp_worker.c:1587 UCX  DIAG  failed to open CM on component rdmacm with status Input/output error
parser.c:2368 UCX  INFO  UCX_* env variables: UCX_TLS=all UCX_LOG_LEVEL=info
ucp_worker.c:1903 UCX  INFO    ucp_context_0 self cfg#1 rma_am(tcp/lo)  amo_am(tcp/lo)  am(tcp/lo)  ka(tcp/lo)
Backend UCX was instantiated
Initialized NIXL agent: c43fadd2-dace-4482-a7c5-eb888e14b753

This means that the CPU or TCP backend was just dispatched by NIXL(UCX). I think we can just use NIXL as the CPU fallback when there are no RDMA devices.

@hnyls2002 你好，是通过指定--disaggregation-transfer-backend nixl来使用nixl吗？另外是否需要提前pip 安装哪些依赖包？

Sep 16 '25 03:09 LY-today