Can NVLink be used when Prefill and Decode are deployed separately on the same machine?
I'm exploring a deployment setup where the Prefill and Decode stages are separated into different services through SGLang, but still running on the same physical machine. In this PD-disaggregate setup, is it possible to leverage NVLink for high-speed GPU communication between these two stages?
I noticed that you're developing NVLink Transport support. However, when I tested it in this setup, I encountered a register_memory error during runtime. Could you help clarify what might cause this? Is it because NVLink transport doesn't currently support the Prefill/Decode disaggregate scenario?
@taohui This is still an ongoing experimental feature. Need to patch some code in SGLang to make it runnable. Will try to make it ready before the end of this month.
Enable USE_CUDA and USE_MNNVL in https://github.com/kvcache-ai/Mooncake/blob/897728ddbfb8c1269c3cb64b4097e281d203faff/mooncake-common/common.cmake
Then compile and install from source.
Set this up with sglang:
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True
export MC_FORCE_MNNVL=True
Enable
USE_CUDAandUSE_MNNVLin https://github.com/kvcache-ai/Mooncake/blob/897728ddbfb8c1269c3cb64b4097e281d203faff/mooncake-common/common.cmake Then compile and install from source.Set this up with sglang:
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True export MC_FORCE_MNNVL=True
I0831 12:16:22.265769 263663 transfer_engine.cpp:199] Topology discovery complete. Found 4 HCAs. E0831 12:16:22.844702 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa5c0a240 0 E0831 12:16:22.844738 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa51ccc00 0 E0831 12:16:22.844743 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa4a0be80 0 E0831 12:16:22.844748 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa4b08ec0 0 E0831 12:16:22.844751 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564f9e9e8000 0 E0831 12:16:22.844758 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x7fb347ffb040 0 E0831 12:16:22.844761 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa0da58c0 0 I followed the instructions, but I'm encountering an error. Could you please tell me what might be causing it? @ShangmingCai
@thqq479 It requires CUDA 12.8+. What is your CUDA version?
@thqq479 It requires CUDA 12.8+. What is your CUDA version?
nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Wed_Jan_15_19:20:09_PST_2025 Cuda compilation tools, release 12.8, V12.8.61 Build cuda_12.8.r12.8/compiler.35404655_0 I upgraded the version, but I'm still getting the same error. I'm using the H100. @ShangmingCai
@alogfans can you help check what causes Unsupported memory type? If SGLANG_MOONCAKE_CUSTOM_MEM_POOL is set to True, then I don't know what causes the problem, maybe it requires hardware compute capability > x.x
@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?
@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?
GB200
@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?
GB200
The GB200 model is too hard to obtain. Could we replicate the test on the H200/H100 instead?
@thqq479
@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?
GB200
The GB200 model is too hard to obtain. Could we replicate the test on the H200/H100 instead?
I think @alogfans have verified this on H800 before. Maybe he can share some insights.
@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?
GB200
The GB200 model is too hard to obtain. Could we replicate the test on the H200/H100 instead?
I think @alogfans have verified this on H800 before. Maybe he can share some insights.
env:H200*2(1p1d)+cuda12.8+sglang0.5.1+mooncake0.3.6
File "/python/sglang/srt/managers/scheduler.py", line 3144, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, moe_ep_rank, pp_rank, dp_rank, balance_meta, cpu_barrier)
File "/python/sglang/srt/managers/scheduler.py", line 373, in init
self.tp_worker = TpWorkerClass(
File "/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 71, in init
self.worker = TpModelWorker(
File "/python/sglang/srt/managers/tp_worker.py", line 127, in init
self.model_runner = ModelRunner(
File "/python/sglang/srt/model_executor/model_runner.py", line 289, in init
self.initialize(min_per_gpu_memory)
File "/python/sglang/srt/model_executor/model_runner.py", line 394, in initialize
self.init_memory_pool(
File "/python/sglang/srt/model_executor/model_runner.py", line 1534, in init_memory_pool
self.token_to_kv_pool = MLATokenToKVPool(
File "/python/sglang/srt/mem_cache/memory_pool.py", line 755, in init
self.kv_buffer = [
File "/python/sglang/srt/mem_cache/memory_pool.py", line 756, in
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True export MC_FORCE_MNNVL=True After setting these two environment variables, the service will OOM when starting up, and it seems that the fragmentation is quite severe? @ShangmingCai @alogfans
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True export MC_FORCE_MNNVL=True After setting these two environment variables, the service will OOM when starting up, and it seems that the fragmentation is quite severe? @ShangmingCai @alogfans
@thqq479 Try decreasing --mem-fraction-static to 0.75 or maybe lower?
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True export MC_FORCE_MNNVL=True After setting these two environment variables, the service will OOM when starting up, and it seems that the fragmentation is quite severe? @ShangmingCai @alogfans
@thqq479 Try decreasing
--mem-fraction-staticto 0.75 or maybe lower?
According to the log, 109.44 GiB is free, which means there is actually a lot of available memory, but fragmentation seems to be severe. I don't know if this is related to the cuda mempool?
@thqq479 No idea why it tried to allocate 2.51 GiB but failed when there are 109.44 GiB available.
@thqq479 No idea why it tried to allocate 2.51 GiB but failed when there are 109.44 GiB available.
@ShangmingCai
sglang 0.5.4 and mooncake-transfer-engine 0.3.6.post1
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True
export MC_FORCE_MNNVL=True
Same error on H20-3e!
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 226.00 MiB. GPU 2 has a total capacity of 139.80 GiB of which 80.78 GiB is free. Including non-PyTorch memory, this process has 59.01 GiB memory in use. Of the allocated memory 57.27 GiB is allocated by PyTorch, and 278.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
--mem-fraction-static 0.7