Mooncake [Feature Request]: Ship CUDA-enabled builds (-DUSE_CUDA=ON) for correct GPU-NIC topology detection

Describe your feature request

We need to publish packages built with -DUSE_CUDA=ON to enable correct GPU-NIC topology detection, which can significantly impact both correctness and tail latency performance.

There are two possible approaches:

Publish to PyPI: Since not all users need GPU-aware transport, we’d likely need to ship two variants—default (CPU-only) and CUDA-enabled (e.g., via a separate package or extra like mooncake[cuda]).
Publish .whl files in GitHub Releases first: Upload CUDA-enabled wheels (e.g., mooncake_cu12.whl) to GitHub Releases and update the docs so users can optionally install via:

pip install https://github.com/xxx/Mooncake/releases/download/relxx/xxx_cu12.whl

I suggest going with option 2 first, as it avoids breaking existing installation workflows and gives us time to evaluate demand before committing to a PyPI strategy.

Before submitting a new issue...

[ ] Make sure you already searched for relevant issues and read the documentation

Sep 25 '25 11:09 xiaguan

I suggest option 2. I'm also refactoring TENT (Transfer Engine NT) to support dynamic loading.

Sep 26 '25 01:09 alogfans

Hey guys, new to the community. After a non-comprehensive test on our cluster, I found with CUDA-enabled builds SGLang could significantly improve the throughput and latency.

For now I build it myself and manually install it into SGLang. But it would be great to ship it by releases.

Some details if someone is finding some references:

w/o CUDA-enabled, specify 2*Mellanox (200Gbps) NIC. we achieve about 40 Gbps throughput per node. The SGLang side metrics is not collected because we are using 0.5.2. The transfer latency is rather high that we can't employ this deployment into production. 1.a There seems to be some issue to specify 8 NICs together with hicache in SGLang, which would cause XID 95 in our machine.
w/ CUDA-enabled, specify 8*Mellanox NIC. Achieved about 100Gbps throughput per node. Transfer latency is cut to ~0.5s (roughly). It's acceptable to put this into production.

Oct 30 '25 03:10 chivalryq

Hey guys, new to the community. After a non-comprehensive test on our cluster, I found with CUDA-enabled builds SGLang could significantly improve the throughput and latency.

For now I build it myself and manually install it into SGLang. But it would be great to ship it by releases.

Some details if someone is finding some references:

w/o CUDA-enabled, specify 2*Mellanox (200Gbps) NIC. we achieve about 40 Gbps throughput per node. The SGLang side metrics is not collected because we are using 0.5.2. The transfer latency is rather high that we can't employ this deployment into production. 1.a There seems to be some issue to specify 8 NICs together with hicache in SGLang, which would cause XID 95 in our machine.

w/ CUDA-enabled, specify 8*Mellanox NIC. Achieved about 100Gbps throughput per node. Transfer latency is cut to ~0.5s (roughly). It's acceptable to put this into production.

Thanks for sharing this with the community!

Good news: starting from v0.3.7, the mooncake wheel package is released with CUDA enabled.

If you're using HiCache with Mooncake and running into long-tail latency issues, you could try this PR: https://github.com/sgl-project/sglang/pull/11028.

Oct 30 '25 06:10 xiaguan

The real problem behind this performance issue is whether the transfer engine can choose the exactly right NIC for the transfer.

Oct 30 '25 06:10 xiaguan

@xiaguan Thank you for the PR! We would give it a try.

Another question, the wheel release is DMA-BUF based GDR or nvidia-peermem based? I'm asking this because we can run nvidia-peermem based wheel without modifying the cluster config, which avoid downtime in a production cluster.

If I understood it right, if we build with -DWITH_NVIDIA_PEERMEM=ON mooncake won't choose DMA-BUF? It would be great if mooncake can detect and auto-choose the implementation.

Oct 31 '25 03:10 chivalryq

@alogfans Could u help answer this, thks

Oct 31 '25 05:10 xiaguan