[Bug]: p2p nvlink usage
Bug Report
Hi,When I was running the demo of nvlink p2p transmission using the Python interface on the H20 machine, I encountered some problems: "E1024 06:51:59.688545 162143 nvlink_transport.cpp:603] NvlinkTransport: cuMemCreate failed: 800". Could you help me with this? Thank you.
My mooncake-transfer-engine package was self-compiled and installed from the source code, with the compilation commands as follows: cmake .. -DUSE_MNNVL=ON -DUSE_CUDA=ON make -j make install.
Before submitting...
- [ ] Ensure you searched for relevant issues and read the [documentation]
@15050188022 What is your cuda version? This feature requires cuda >= 12.8
@15050188022 What is your cuda version? This feature requires cuda >= 12.8
The samples I used is https://kvcache-ai.github.io/Mooncake/python-api-reference/transfer-engine.html . H20 with cuda==12.8
CC: @alogfans can you help? Under what circumstances will this cuMemCreate fail?
@15050188022 What is your cuda version? This feature requires cuda >= 12.8
The samples I used is https://kvcache-ai.github.io/Mooncake/python-api-reference/transfer-engine.html . H20 with cuda==12.8
![]()
![]()
when i try to set MC_USE_NVLINK_IPC=1 to fall back to IPC,it does not works either as follows:
In addition, my issue seems to be similar to the one in this pr 683 Has it been resolved before?
CC: @alogfans can you help? Under what circumstances will this
cuMemCreatefail?
We have used the SGLang:0.5.4 images(docker pull lmsysorg/sglang:v0.5.4) to deploy pd disaggregation on the same node(H20-3e). We want to use the NVLINK:
export SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True
export MC_FORCE_MNNVL=True
MC_TE_METRIC=1 CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.launch_server --model-path /mnt/nvme/models/Qwen3-235B-A22B-Instruct-2507-FP8/ --port 7000 --host 0.0.0.0 --tensor-parallel-size 4 --disaggregation-mode prefill --disaggregation-transfer-backend mooncake --disaggregation-ib-device auto --trust-remote-code --disable-radix-cache
We met the following error:
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2747, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 318, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 463, in initialize
self.init_memory_pool(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1888, in init_memory_pool
self.token_to_kv_pool = MHATokenToKVPool(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 514, in __init__
super().__init__(
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 438, in __init__
allocator = NVLinkAllocator.get_allocator(self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/mooncake/allocator.py", line 43, in get_allocator
cls._instances[device] = CUDAPluggableAllocator(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py", line 1122, in __init__
allocator = ctypes.CDLL(path_to_so_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/ctypes/__init__.py", line 379, in __init__
self._handle = _dlopen(self._name, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate
After debugging:
root@server001:/sgl-workspace/sglang# nm -D /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so | grep cuMemCreate
U cuMemCreate
root@server001:/sgl-workspace/sglang# readelf -d /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so | grep NEEDED
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2]
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang# patchelf --add-needed libcuda.so.1 \
/usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so
root@server001:/sgl-workspace/sglang#
The error is solved.
But new problem comes:
[2025-10-31 06:32:10 TP0] Using KV cache dtype: torch.bfloat16
cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800[2025-10-31 06:32:10 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2747, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 318, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 463, in initialize
self.init_memory_pool(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1888, in init_memory_pool
self.token_to_kv_pool = MHATokenToKVPool(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 527, in __init__
self._create_buffers()
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 592, in _create_buffers
torch.zeros(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 376.00 MiB. GPU 2 has a total capacity of 139.80 GiB of which 79.85 GiB is free. Including non-PyTorch memory, this process has 59.94 GiB memory in use. Of the allocated memory 58.20 GiB is allocated by PyTorch, and 278.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Decreasing --mem-fraction-static to 0.6 still does not work!
root@server001:/sgl-workspace/sglang# pip3 list |grep mooncake
mooncake-transfer-engine 0.3.6.post1
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang# pip3 list | grep sglang
sglang 0.5.4 /sgl-workspace/sglang/python
sglang-router 0.2.1
root@server001:/sgl-workspace/sglang#
root@server001:/sgl-workspace/sglang# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0
root@server001:/sgl-workspace/sglang#
Do you have any idea about this? @ShangmingCai
@ChuanhongLi Maybe search how to address OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate.
@ChuanhongLi Maybe search how to address
OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate.After debugging:
root@server001:/sgl-workspace/sglang# nm -D /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so | grep cuMemCreate U cuMemCreate root@server001:/sgl-workspace/sglang# readelf -d /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so | grep NEEDED 0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6] 0x0000000000000001 (NEEDED) Shared library: [libc.so.6] 0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2] root@server001:/sgl-workspace/sglang# root@server001:/sgl-workspace/sglang# root@server001:/sgl-workspace/sglang# patchelf --add-needed libcuda.so.1 /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so root@server001:/sgl-workspace/sglang# The error is solved.
@ChuanhongLi Maybe search how to address
OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate.
After patchelf --add-needed libcuda.so.1 /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so, the error has been resovled. But the new problem is:
[2025-10-31 06:32:10 TP0] Using KV cache dtype: torch.bfloat16
cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800[2025-10-31 06:32:10 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2747, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 318, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 463, in initialize
self.init_memory_pool(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1888, in init_memory_pool
self.token_to_kv_pool = MHATokenToKVPool(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 527, in __init__
self._create_buffers()
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 592, in _create_buffers
torch.zeros(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 376.00 MiB. GPU 2 has a total capacity of 139.80 GiB of which 79.85 GiB is free. Including non-PyTorch memory, this process has 59.94 GiB memory in use. Of the allocated memory 58.20 GiB is allocated by PyTorch, and 278.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
@ChuanhongLi Maybe search how to address
OSError: /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so: undefined symbol: cuMemCreate.
After patchelf --add-needed libcuda.so.1 /usr/local/lib/python3.12/dist-packages/mooncake/nvlink_allocator.so, the error has been resovled. But the new problem is:
[2025-10-31 06:32:10 TP0] Using KV cache dtype: torch.bfloat16
cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800cuMemCreate failed: 800[2025-10-31 06:32:10 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2747, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 235, in __init__
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 318, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 463, in initialize
self.init_memory_pool(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1888, in init_memory_pool
self.token_to_kv_pool = MHATokenToKVPool(
^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 527, in __init__
self._create_buffers()
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 592, in _create_buffers
torch.zeros(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 376.00 MiB. GPU 2 has a total capacity of 139.80 GiB of which 79.85 GiB is free. Including non-PyTorch memory, this process has 59.94 GiB memory in use. Of the allocated memory 58.20 GiB is allocated by PyTorch, and 278.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The same issue, have you solved it yet? @ChuanhongLi
The same issue, have you solved it yet? @ChuanhongLi
Not yet...
Hello, I met the same issue too... Could you please let me know what progress you have made so far?
On the H100 machine, I tried the latest tent version, but I still got the same OOM error @alogfans @ShangmingCai Hi,Could you please assist us?
When SGLANG_MOONCAKE_CUSTOM_MEM_POOL is set to true, the GPU memory becomes abnormal/ malfunctioning
On the H100 machine, I tried the latest tent version, but I still got the same OOM error @alogfans @ShangmingCai Hi,Could you please assist us?
@thqq479 In fact, this MNNVL feature is for GB200 MNNVL, so we are not sure it fits the usage for Hopper intra-node NVlink. There are people in the community who want to impl general nvlink transport backend. Stay tuned.
When SGLANG_MOONCAKE_CUSTOM_MEM_POOL is set to true, the GPU memory becomes abnormal/ malfunctioning
Try reducing mem-fraction-static to a lower value.
On the H100 machine, I tried the latest tent version, but I still got the same OOM error @alogfans @ShangmingCai Hi,Could you please assist us?
@thqq479 In fact, this MNNVL feature is for GB200 MNNVL, so we are not sure it fits the usage for Hopper intra-node NVlink. There are people in the community who want to impl general nvlink transport backend. Stay tuned.
Great! Does this already have a PR?
@thqq479 Not yet. You can contact Yaozhong Liu[Aliyun] in the slack channel.
The same issue, have you solved it yet? @ChuanhongLi
Not yet...
Hello, I met the same issue too... Could you please let me know what progress you have made so far?
Give up... hha
