Mooncake [Bug]: p2p nvlink usage

Bug Report

Hi，When I was running the demo of nvlink p2p transmission using the Python interface on the H20 machine, I encountered some problems: "E1024 06:51:59.688545 162143 nvlink_transport.cpp:603] NvlinkTransport: cuMemCreate failed: 800". Could you help me with this? Thank you.

My mooncake-transfer-engine package was self-compiled and installed from the source code, with the compilation commands as follows: cmake .. -DUSE_MNNVL=ON -DUSE_CUDA=ON make -j make install.

Before submitting...

[ ] Ensure you searched for relevant issues and read the [documentation]

Oct 24 '25 15:10 15050188022

@15050188022 What is your cuda version? This feature requires cuda >= 12.8

Oct 27 '25 04:10 ShangmingCai

@15050188022 What is your cuda version? This feature requires cuda >= 12.8

The samples I used is https://kvcache-ai.github.io/Mooncake/python-api-reference/transfer-engine.html . H20 with cuda==12.8

Oct 27 '25 10:10 15050188022

CC: @alogfans can you help? Under what circumstances will this cuMemCreate fail?

Oct 27 '25 11:10 ShangmingCai

@15050188022 What is your cuda version? This feature requires cuda >= 12.8

The samples I used is https://kvcache-ai.github.io/Mooncake/python-api-reference/transfer-engine.html . H20 with cuda==12.8

when i try to set MC_USE_NVLINK_IPC=1 to fall back to IPC,it does not works either as follows:

In addition, my issue seems to be similar to the one in this pr 683 Has it been resolved before?

Oct 28 '25 04:10 15050188022

CC: @alogfans can you help? Under what circumstances will this cuMemCreate fail?

We have used the SGLang:0.5.4 images(docker pull lmsysorg/sglang:v0.5.4) to deploy pd disaggregation on the same node(H20-3e). We want to use the NVLINK:

export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True
export MC_FORCE_MNNVL=True
MC_TE_METRIC=1 CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.launch_server --model-path /mnt/nvme/models/Qwen3-235B-A22B-Instruct-2507-FP8/ --port 7000 --host 0.0.0.0 --tensor-parallel-size 4 --disaggregation-mode prefill --disaggregation-transfer-backend mooncake --disaggregation-ib-device auto --trust-remote-code --disable-radix-cache

We met the following error:

File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2747, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 318, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 463, in initialize
    self.init_memory_pool(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1888, in init_memory_pool
    self.token_to_kv_pool = MHATokenToKVPool(
                            ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 514, in __init__
    super().__init__(
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 438, in __init__
    allocator = NVLinkAllocator.get_allocator(self.device)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/mooncake/allocator.py", line 43, in get_allocator
    cls._instances[device] = CUDAPluggableAllocator(
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 1122, in __init__
    allocator = ctypes.CDLL(path_to_so_file)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate

After debugging:

root@server001:/sgl-workspace/sglang# nm -D /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so | grep cuMemCreate
                 U cuMemCreate
root@server001:/sgl-workspace/sglang# readelf -d /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang# patchelf --add-needed libcuda.so.1 \
  /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so
root@server001:/sgl-workspace/sglang#

The error is solved.

But new problem comes:

[2025-10-31 06:32:10 TP0] Using KV cache dtype: torch.bfloat16
cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800[2025-10-31 06:32:10 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2747, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 318, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 463, in initialize
    self.init_memory_pool(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1888, in init_memory_pool
    self.token_to_kv_pool = MHATokenToKVPool(
                            ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 527, in __init__
    self._create_buffers()
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 592, in _create_buffers
    torch.zeros(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 376.00 MiB. GPU 2 has a total capacity of 139.80 GiB of which 79.85 GiB is free. Including non-PyTorch memory, this process has 59.94 GiB memory in use. Of the allocated memory 58.20 GiB is allocated by PyTorch, and 278.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Decreasing --mem-fraction-static to 0.6 still does not work!

root@server001:/sgl-workspace/sglang# pip3 list |grep mooncake
mooncake-transfer-engine  0.3.6.post1
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang# pip3 list | grep sglang
sglang                    0.5.4         /sgl-workspace/sglang/python
sglang-router             0.2.1
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0
root@server001:/sgl-workspace/sglang#

Do you have any idea about this? @ShangmingCai

Oct 31 '25 07:10 ChuanhongLi

@ChuanhongLi Maybe search how to address OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate.

Oct 31 '25 09:10 ShangmingCai

@ChuanhongLi Maybe search how to address OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate.

After debugging:

root@server001:/sgl-workspace/sglang# nm -D /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so | grep cuMemCreate U cuMemCreate root@server001:/sgl-workspace/sglang# readelf -d /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so | grep NEEDED 0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6] 0x0000000000000001 (NEEDED) Shared library: [libc.so.6] 0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2] root@server001:/sgl-workspace/sglang# root@server001:/sgl-workspace/sglang# root@server001:/sgl-workspace/sglang# patchelf --add-needed libcuda.so.1 /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so root@server001:/sgl-workspace/sglang# The error is solved.

@ChuanhongLi Maybe search how to address OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate.

After patchelf --add-needed libcuda.so.1 /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so, the error has been resovled. But the new problem is:

[2025-10-31 06:32:10 TP0] Using KV cache dtype: torch.bfloat16
cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800[2025-10-31 06:32:10 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2747, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 318, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 463, in initialize
    self.init_memory_pool(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1888, in init_memory_pool
    self.token_to_kv_pool = MHATokenToKVPool(
                            ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 527, in __init__
    self._create_buffers()
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 592, in _create_buffers
    torch.zeros(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 376.00 MiB. GPU 2 has a total capacity of 139.80 GiB of which 79.85 GiB is free. Including non-PyTorch memory, this process has 59.94 GiB memory in use. Of the allocated memory 58.20 GiB is allocated by PyTorch, and 278.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Nov 01 '25 03:11 ChuanhongLi

@ChuanhongLi Maybe search how to address OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate.

After patchelf --add-needed libcuda.so.1 /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so, the error has been resovled. But the new problem is:

[2025-10-31 06:32:10 TP0] Using KV cache dtype: torch.bfloat16
cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800[2025-10-31 06:32:10 TP2] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2747, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 318, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 463, in initialize
    self.init_memory_pool(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1888, in init_memory_pool
    self.token_to_kv_pool = MHATokenToKVPool(
                            ^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 527, in __init__
    self._create_buffers()
  File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 592, in _create_buffers
    torch.zeros(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 376.00 MiB. GPU 2 has a total capacity of 139.80 GiB of which 79.85 GiB is free. Including non-PyTorch memory, this process has 59.94 GiB memory in use. Of the allocated memory 58.20 GiB is allocated by PyTorch, and 278.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Nov 01 '25 03:11 ChuanhongLi

The same issue, have you solved it yet? @ChuanhongLi

Nov 05 '25 05:11 thqq479

The same issue, have you solved it yet? @ChuanhongLi

Not yet...

Nov 05 '25 09:11 ChuanhongLi

The same issue, have you solved it yet? @ChuanhongLi

Not yet...

Hello, I met the same issue too... Could you please let me know what progress you have made so far?

Nov 19 '25 08:11 Yohuna

On the H100 machine, I tried the latest tent version, but I still got the same OOM error @alogfans @ShangmingCai Hi,Could you please assist us?

Nov 26 '25 06:11 thqq479

When SGLANG_MOONCAKE_CUSTOM_MEM_POOL is set to true, the GPU memory becomes abnormal/ malfunctioning

Nov 26 '25 06:11 thqq479

On the H100 machine, I tried the latest tent version, but I still got the same OOM error @alogfans @ShangmingCai Hi,Could you please assist us?

@thqq479 In fact, this MNNVL feature is for GB200 MNNVL, so we are not sure it fits the usage for Hopper intra-node NVlink. There are people in the community who want to impl general nvlink transport backend. Stay tuned.

Nov 26 '25 07:11 ShangmingCai

When SGLANG_MOONCAKE_CUSTOM_MEM_POOL is set to true, the GPU memory becomes abnormal/ malfunctioning

Try reducing mem-fraction-static to a lower value.

Nov 26 '25 07:11 ShangmingCai

On the H100 machine, I tried the latest tent version, but I still got the same OOM error @alogfans @ShangmingCai Hi,Could you please assist us?

@thqq479 In fact, this MNNVL feature is for GB200 MNNVL, so we are not sure it fits the usage for Hopper intra-node NVlink. There are people in the community who want to impl general nvlink transport backend. Stay tuned.

Great! Does this already have a PR?

Nov 26 '25 09:11 thqq479

@thqq479 Not yet. You can contact Yaozhong Liu[Aliyun] in the slack channel.

Nov 26 '25 09:11 ShangmingCai

The same issue, have you solved it yet? @ChuanhongLi

Not yet...

Hello, I met the same issue too... Could you please let me know what progress you have made so far?

Give up... hha

Nov 27 '25 01:11 ChuanhongLi