Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

Can NVLink be used when Prefill and Decode are deployed separately on the same machine?

Open taohui opened this issue 6 months ago • 16 comments

I'm exploring a deployment setup where the Prefill and Decode stages are separated into different services through SGLang, but still running on the same physical machine. In this PD-disaggregate setup, is it possible to leverage NVLink for high-speed GPU communication between these two stages?

I noticed that you're developing NVLink Transport support. However, when I tested it in this setup, I encountered a register_memory error during runtime. Could you help clarify what might cause this? Is it because NVLink transport doesn't currently support the Prefill/Decode disaggregate scenario?

taohui avatar Jun 12 '25 07:06 taohui

@taohui This is still an ongoing experimental feature. Need to patch some code in SGLang to make it runnable. Will try to make it ready before the end of this month.

ShangmingCai avatar Jun 14 '25 05:06 ShangmingCai

Enable USE_CUDA and USE_MNNVL in https://github.com/kvcache-ai/Mooncake/blob/897728ddbfb8c1269c3cb64b4097e281d203faff/mooncake-common/common.cmake Then compile and install from source.

Set this up with sglang:

export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True
export MC_FORCE_MNNVL=True

ShangmingCai avatar Aug 31 '25 07:08 ShangmingCai

Enable USE_CUDA and USE_MNNVL in https://github.com/kvcache-ai/Mooncake/blob/897728ddbfb8c1269c3cb64b4097e281d203faff/mooncake-common/common.cmake Then compile and install from source.

Set this up with sglang:

export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True
export MC_FORCE_MNNVL=True

I0831 12:16:22.265769 263663 transfer_engine.cpp:199] Topology discovery complete. Found 4 HCAs. E0831 12:16:22.844702 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa5c0a240 0 E0831 12:16:22.844738 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa51ccc00 0 E0831 12:16:22.844743 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa4a0be80 0 E0831 12:16:22.844748 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa4b08ec0 0 E0831 12:16:22.844751 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564f9e9e8000 0 E0831 12:16:22.844758 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x7fb347ffb040 0 E0831 12:16:22.844761 263663 nvlink_transport.cpp:352] Unsupported memory type, 0x564fa0da58c0 0 I followed the instructions, but I'm encountering an error. Could you please tell me what might be causing it? @ShangmingCai

thqq479 avatar Aug 31 '25 14:08 thqq479

@thqq479 It requires CUDA 12.8+. What is your CUDA version?

ShangmingCai avatar Aug 31 '25 14:08 ShangmingCai

@thqq479 It requires CUDA 12.8+. What is your CUDA version?

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Wed_Jan_15_19:20:09_PST_2025 Cuda compilation tools, release 12.8, V12.8.61 Build cuda_12.8.r12.8/compiler.35404655_0 I upgraded the version, but I'm still getting the same error. I'm using the H100. @ShangmingCai

thqq479 avatar Sep 04 '25 06:09 thqq479

@alogfans can you help check what causes Unsupported memory type? If SGLANG_MOONCAKE_CUSTOM_MEM_POOL is set to True, then I don't know what causes the problem, maybe it requires hardware compute capability > x.x

ShangmingCai avatar Sep 04 '25 08:09 ShangmingCai

@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?

thqq479 avatar Sep 08 '25 09:09 thqq479

@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?

GB200

ShangmingCai avatar Sep 08 '25 09:09 ShangmingCai

@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?

GB200

The GB200 model is too hard to obtain. Could we replicate the test on the H200/H100 instead?

thqq479 avatar Sep 15 '25 09:09 thqq479

@thqq479

@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?

GB200

The GB200 model is too hard to obtain. Could we replicate the test on the H200/H100 instead?

I think @alogfans have verified this on H800 before. Maybe he can share some insights.

ShangmingCai avatar Sep 15 '25 09:09 ShangmingCai

@thqq479

@alogfans @ShangmingCai Hi, may I ask how I should proceed to further investigate this issue, or what environment you tested in?

GB200

The GB200 model is too hard to obtain. Could we replicate the test on the H200/H100 instead?

I think @alogfans have verified this on H800 before. Maybe he can share some insights.

env:H200*2(1p1d)+cuda12.8+sglang0.5.1+mooncake0.3.6

File "/python/sglang/srt/managers/scheduler.py", line 3144, in run_scheduler_process scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, moe_ep_rank, pp_rank, dp_rank, balance_meta, cpu_barrier) File "/python/sglang/srt/managers/scheduler.py", line 373, in init self.tp_worker = TpWorkerClass( File "/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 71, in init self.worker = TpModelWorker( File "/python/sglang/srt/managers/tp_worker.py", line 127, in init self.model_runner = ModelRunner( File "/python/sglang/srt/model_executor/model_runner.py", line 289, in init self.initialize(min_per_gpu_memory) File "/python/sglang/srt/model_executor/model_runner.py", line 394, in initialize self.init_memory_pool( File "/python/sglang/srt/model_executor/model_runner.py", line 1534, in init_memory_pool self.token_to_kv_pool = MLATokenToKVPool( File "/python/sglang/srt/mem_cache/memory_pool.py", line 755, in init self.kv_buffer = [ File "/python/sglang/srt/mem_cache/memory_pool.py", line 756, in torch.zeros( torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.51 GiB. GPU 1 has a total capacity of 139.81 GiB of which 109.44 GiB is free. Including non-PyTorch memory, this process has 30.36 GiB memory in use. Of the allocated memory 29.66 GiB is allocated by PyTorch, and 41.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

thqq479 avatar Sep 25 '25 08:09 thqq479

export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True export MC_FORCE_MNNVL=True After setting these two environment variables, the service will OOM when starting up, and it seems that the fragmentation is quite severe? @ShangmingCai @alogfans

thqq479 avatar Sep 25 '25 08:09 thqq479

export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True export MC_FORCE_MNNVL=True After setting these two environment variables, the service will OOM when starting up, and it seems that the fragmentation is quite severe? @ShangmingCai @alogfans

@thqq479 Try decreasing --mem-fraction-static to 0.75 or maybe lower?

ShangmingCai avatar Sep 25 '25 09:09 ShangmingCai

export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True export MC_FORCE_MNNVL=True After setting these two environment variables, the service will OOM when starting up, and it seems that the fragmentation is quite severe? @ShangmingCai @alogfans

@thqq479 Try decreasing --mem-fraction-static to 0.75 or maybe lower?

According to the log, 109.44 GiB is free, which means there is actually a lot of available memory, but fragmentation seems to be severe. I don't know if this is related to the cuda mempool?

thqq479 avatar Sep 26 '25 06:09 thqq479

@thqq479 No idea why it tried to allocate 2.51 GiB but failed when there are 109.44 GiB available.

ShangmingCai avatar Sep 26 '25 09:09 ShangmingCai

@thqq479 No idea why it tried to allocate 2.51 GiB but failed when there are 109.44 GiB available.

@ShangmingCai sglang 0.5.4 and mooncake-transfer-engine 0.3.6.post1 export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True export MC_FORCE_MNNVL=True Same error on H20-3e!
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 226.00 MiB. GPU 2 has a total capacity of 139.80 GiB of which 80.78 GiB is free. Including non-PyTorch memory, this process has 59.01 GiB memory in use. Of the allocated memory 57.27 GiB is allocated by PyTorch, and 278.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

--mem-fraction-static 0.7

ChuanhongLi avatar Oct 31 '25 06:10 ChuanhongLi