[Bug] Mooncake memory registration failed
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
srt/disaggregation/mooncake/transfer_engine.py line 36, in register raise RuntimeError("Mooncake memory registration failed.") E0511 07:42:36.355509 10240 rdma_context.cpp:198] Failed to register memory 0x2e1cbac200: Bad address [14]
Reproduction
--disaggregation-mode prefill
Environment
Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H800 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda-12.4 NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 535.230.02 PyTorch: 2.6.0+cu124 sglang: 0.4.6.post2 sgl_kernel: 0.1.1 flashinfer_python: 0.2.5 triton: 3.2.0 transformers: 4.51.1 torchao: 0.10.0 numpy: 1.26.4 aiohttp: 3.11.18 fastapi: 0.115.12 hf_transfer: 0.1.9 huggingface_hub: 0.31.1 interegular: 0.3.3 modelscope: 1.25.0 orjson: 3.10.18 outlines: 0.1.11 packaging: 25.0 psutil: 7.0.0 pydantic: 2.11.4 python-multipart: 0.0.20 pyzmq: 26.4.0 uvicorn: 0.34.2 uvloop: 0.21.0 vllm: 0.8.2 xgrammar: 0.1.16 openai: 1.75.0 tiktoken: 0.9.0 anthropic: 0.51.0 litellm: 1.68.1 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 SYS PIX PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 SYS PHB PIX PHB PHB SYS SYS SYS SYS 0-89 0 N/A GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 SYS PHB PHB PIX PHB SYS SYS SYS SYS 0-89 0 N/A GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 SYS PHB PHB PHB PIX SYS SYS SYS SYS 0-89 0 N/A GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS SYS PIX PHB PHB PHB 90-179 1 N/A GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS SYS PHB PIX PHB PHB 90-179 1 N/A GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS SYS PHB PHB PIX PHB 90-179 1 N/A GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS SYS PHB PHB PHB PIX 90-179 1 N/A NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS NIC1 PIX PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB SYS SYS SYS SYS NIC2 PHB PIX PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB SYS SYS SYS SYS NIC3 PHB PHB PIX PHB SYS SYS SYS SYS SYS PHB PHB X PHB SYS SYS SYS SYS NIC4 PHB PHB PHB PIX SYS SYS SYS SYS SYS PHB PHB PHB X SYS SYS SYS SYS NIC5 SYS SYS SYS SYS PIX PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB NIC6 SYS SYS SYS SYS PHB PIX PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB NIC7 SYS SYS SYS SYS PHB PHB PIX PHB SYS SYS SYS SYS SYS PHB PHB X PHB NIC8 SYS SYS SYS SYS PHB PHB PHB PIX SYS SYS SYS SYS SYS PHB PHB PHB X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8
Hypervisor vendor: KVM ulimit soft: 1048576
gdrcopy_copybw : GPU id:0; name: NVIDIA H800; Bus id: 0000:63:00 GPU id:1; name: NVIDIA H800; Bus id: 0000:67:00 GPU id:2; name: NVIDIA H800; Bus id: 0000:6b:00 GPU id:3; name: NVIDIA H800; Bus id: 0000:6f:00 GPU id:4; name: NVIDIA H800; Bus id: 0000:a3:00 GPU id:5; name: NVIDIA H800; Bus id: 0000:a7:00 GPU id:6; name: NVIDIA H800; Bus id: 0000:ab:00 GPU id:7; name: NVIDIA H800; Bus id: 0000:af:00 selecting device 0 testing size: 131072 rounded size: 131072 gpu alloc fn: cuMemAlloc device ptr: 7fbfb7e00000 map_d_ptr: 0x7fc1e81e7000 info.va: 7fbfb7e00000 info.mapped_size: 131072 info.page_size: 65536 info.mapped: 1 info.wc_mapping: 1 page offset: 0 user-space pointer:0x7fc1e81e7000 writing test, size=131072 offset=0 num_iters=10000 write BW: 17884.9MB/s reading test, size=131072 offset=0 num_iters=100 read BW: 669.866MB/s unmapping buffer unpinning buffer closing gdrdrv
I got the same error, what's the version of mooncake of yours?
same error
I0513 11:04:17.446540 5509 transfer_engine.cpp:350] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0513 11:04:17.446815 5509 transfer_engine.cpp:44] Transfer Engine starting. Server: 10.148.0.107, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I0513 11:04:17.446906 5509 transfer_engine.cpp:100] Transfer Engine RPC using P2P handshake, listening on 10.148.0.107:16845
I0513 11:04:17.447018 5509 transfer_engine.cpp:112] Auto-discovering topology...
I0513 11:04:17.447331 5509 transfer_engine.cpp:127] Topology discovery complete. Found 3 HCAs.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0513 11:04:17.451316 5511 transfer_engine.cpp:350] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0513 11:04:17.451431 5511 transfer_engine.cpp:44] Transfer Engine starting. Server: 10.148.0.107, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I0513 11:04:17.451501 5511 transfer_engine.cpp:100] Transfer Engine RPC using P2P handshake, listening on 10.148.0.107:16093
I0513 11:04:17.451609 5511 transfer_engine.cpp:112] Auto-discovering topology...
I0513 11:04:17.451884 5511 transfer_engine.cpp:127] Topology discovery complete. Found 3 HCAs.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0513 11:04:17.452726 5512 transfer_engine.cpp:350] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0513 11:04:17.452818 5512 transfer_engine.cpp:44] Transfer Engine starting. Server: 10.148.0.107, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I0513 11:04:17.452875 5512 transfer_engine.cpp:100] Transfer Engine RPC using P2P handshake, listening on 10.148.0.107:15853
I0513 11:04:17.452991 5512 transfer_engine.cpp:112] Auto-discovering topology...
I0513 11:04:17.453236 5512 transfer_engine.cpp:127] Topology discovery complete. Found 3 HCAs.
I0513 11:04:17.457496 5509 rdma_context.cpp:411] Find best gid index: 0 on mlx5_4"temp"/
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0513 11:04:17.457509 5505 transfer_engine.cpp:350] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0513 11:04:17.457654 5505 transfer_engine.cpp:44] Transfer Engine starting. Server: 10.148.0.107, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I0513 11:04:17.457715 5505 transfer_engine.cpp:100] Transfer Engine RPC using P2P handshake, listening on 10.148.0.107:16766
I0513 11:04:17.457805 5505 transfer_engine.cpp:112] Auto-discovering topology...
I0513 11:04:17.458135 5505 transfer_engine.cpp:127] Topology discovery complete. Found 3 HCAs.
I0513 11:04:17.458601 5509 rdma_context.cpp:125] RDMA device: mlx5_4"temp", LID: 66, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:24:c9:26
I0513 11:04:17.461931 5511 rdma_context.cpp:411] Find best gid index: 0 on mlx5_4"temp"/
I0513 11:04:17.462610 5511 rdma_context.cpp:125] RDMA device: mlx5_4"temp", LID: 66, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:24:c9:26
I0513 11:04:17.462841 5512 rdma_context.cpp:411] Find best gid index: 0 on mlx5_4"temp"/
I0513 11:04:17.463502 5512 rdma_context.cpp:125] RDMA device: mlx5_4"temp", LID: 66, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:24:c9:26
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0513 11:04:17.464078 5508 transfer_engine.cpp:350] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0513 11:04:17.464174 5508 transfer_engine.cpp:44] Transfer Engine starting. Server: 10.148.0.107, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I0513 11:04:17.464231 5508 transfer_engine.cpp:100] Transfer Engine RPC using P2P handshake, listening on 10.148.0.107:15913
I0513 11:04:17.464336 5508 transfer_engine.cpp:112] Auto-discovering topology...
I0513 11:04:17.464612 5508 transfer_engine.cpp:127] Topology discovery complete. Found 3 HCAs.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0513 11:04:17.467101 5507 transfer_engine.cpp:350] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0513 11:04:17.467196 5507 transfer_engine.cpp:44] Transfer Engine starting. Server: 10.148.0.107, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I0513 11:04:17.467249 5507 transfer_engine.cpp:100] Transfer Engine RPC using P2P handshake, listening on 10.148.0.107:15905
I0513 11:04:17.467348 5507 transfer_engine.cpp:112] Auto-discovering topology...
I0513 11:04:17.467649 5507 transfer_engine.cpp:127] Topology discovery complete. Found 3 HCAs.
I0513 11:04:17.468641 5505 rdma_context.cpp:411] Find best gid index: 0 on mlx5_4"temp"/
I0513 11:04:17.469282 5505 rdma_context.cpp:125] RDMA device: mlx5_4"temp", LID: 66, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:24:c9:26
I0513 11:04:17.469362 5509 rdma_context.cpp:411] Find best gid index: 0 on mlx5_5"temp"/
I0513 11:04:17.470508 5509 rdma_context.cpp:125] RDMA device: mlx5_5"temp", LID: 51, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:14:aa
I0513 11:04:17.471184 5511 rdma_context.cpp:411] Find best gid index: 0 on mlx5_5"temp"/
I0513 11:04:17.471905 5511 rdma_context.cpp:125] RDMA device: mlx5_5"temp", LID: 51, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:14:aa
I0513 11:04:17.473951 5508 rdma_context.cpp:411] Find best gid index: 0 on mlx5_4"temp"/
I0513 11:04:17.474507 5512 rdma_context.cpp:411] Find best gid index: 0 on mlx5_5"temp"/
I0513 11:04:17.474612 5508 rdma_context.cpp:125] RDMA device: mlx5_4"temp", LID: 66, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:24:c9:26
I0513 11:04:17.475162 5512 rdma_context.cpp:125] RDMA device: mlx5_5"temp", LID: 51, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:14:aa
I0513 11:04:17.477639 5507 rdma_context.cpp:411] Find best gid index: 0 on mlx5_4"temp"/
I0513 11:04:17.478277 5507 rdma_context.cpp:125] RDMA device: mlx5_4"temp", LID: 66, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:24:c9:26
I0513 11:04:17.479204 5505 rdma_context.cpp:411] Find best gid index: 0 on mlx5_5"temp"/
I0513 11:04:17.479861 5505 rdma_context.cpp:125] RDMA device: mlx5_5"temp", LID: 51, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:14:aa
I0513 11:04:17.480155 5509 rdma_context.cpp:411] Find best gid index: 0 on mlx5_3/
I0513 11:04:17.480209 5511 rdma_context.cpp:411] Find best gid index: 0 on mlx5_3/
I0513 11:04:17.481312 5509 rdma_context.cpp:125] RDMA device: mlx5_3, LID: 31, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:13:92
I0513 11:04:17.481333 5511 rdma_context.cpp:125] RDMA device: mlx5_3, LID: 31, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:13:92
I0513 11:04:17.484032 5508 rdma_context.cpp:411] Find best gid index: 0 on mlx5_5"temp"/
I0513 11:04:17.484501 5512 rdma_context.cpp:411] Find best gid index: 0 on mlx5_3/
I0513 11:04:17.484687 5508 rdma_context.cpp:125] RDMA device: mlx5_5"temp", LID: 51, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:14:aa
I0513 11:04:17.485165 5512 rdma_context.cpp:125] RDMA device: mlx5_3, LID: 31, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:13:92
I0513 11:04:17.488559 5507 rdma_context.cpp:411] Find best gid index: 0 on mlx5_5"temp"/
I0513 11:04:17.489305 5507 rdma_context.cpp:125] RDMA device: mlx5_5"temp", LID: 51, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:14:aa
I0513 11:04:17.489553 5505 rdma_context.cpp:411] Find best gid index: 0 on mlx5_3/
I0513 11:04:17.490327 5505 rdma_context.cpp:125] RDMA device: mlx5_3, LID: 31, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:13:92
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0513 11:04:17.491269 5506 transfer_engine.cpp:350] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0513 11:04:17.491389 5506 transfer_engine.cpp:44] Transfer Engine starting. Server: 10.148.0.107, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I0513 11:04:17.491463 5506 transfer_engine.cpp:100] Transfer Engine RPC using P2P handshake, listening on 10.148.0.107:16507
I0513 11:04:17.491551 5506 transfer_engine.cpp:112] Auto-discovering topology...
I0513 11:04:17.491820 5506 transfer_engine.cpp:127] Topology discovery complete. Found 3 HCAs.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0513 11:04:17.493638 5510 transfer_engine.cpp:350] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0513 11:04:17.493726 5510 transfer_engine.cpp:44] Transfer Engine starting. Server: 10.148.0.107, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I0513 11:04:17.493782 5510 transfer_engine.cpp:100] Transfer Engine RPC using P2P handshake, listening on 10.148.0.107:16320
I0513 11:04:17.493878 5510 transfer_engine.cpp:112] Auto-discovering topology...
I0513 11:04:17.494138 5510 transfer_engine.cpp:127] Topology discovery complete. Found 3 HCAs.
I0513 11:04:17.496711 5508 rdma_context.cpp:411] Find best gid index: 0 on mlx5_3/
I0513 11:04:17.497254 5507 rdma_context.cpp:411] Find best gid index: 0 on mlx5_3/
I0513 11:04:17.497359 5508 rdma_context.cpp:125] RDMA device: mlx5_3, LID: 31, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:13:92
I0513 11:04:17.498096 5507 rdma_context.cpp:125] RDMA device: mlx5_3, LID: 31, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:13:92
I0513 11:04:17.502910 5506 rdma_context.cpp:411] Find best gid index: 0 on mlx5_4"temp"/
I0513 11:04:17.503547 5510 rdma_context.cpp:411] Find best gid index: 0 on mlx5_4"temp"/
I0513 11:04:17.503665 5506 rdma_context.cpp:125] RDMA device: mlx5_4"temp", LID: 66, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:24:c9:26
I0513 11:04:17.504220 5510 rdma_context.cpp:125] RDMA device: mlx5_4"temp", LID: 66, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:24:c9:26
I0513 11:04:17.512956 5506 rdma_context.cpp:411] Find best gid index: 0 on mlx5_5"temp"/
I0513 11:04:17.513473 5510 rdma_context.cpp:411] Find best gid index: 0 on mlx5_5"temp"/
I0513 11:04:17.513661 5506 rdma_context.cpp:125] RDMA device: mlx5_5"temp", LID: 51, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:14:aa
I0513 11:04:17.514109 5510 rdma_context.cpp:125] RDMA device: mlx5_5"temp", LID: 51, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:14:aa
I0513 11:04:17.522951 5506 rdma_context.cpp:411] Find best gid index: 0 on mlx5_3/
I0513 11:04:17.523634 5506 rdma_context.cpp:125] RDMA device: mlx5_3, LID: 31, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:13:92
I0513 11:04:17.524181 5510 rdma_context.cpp:411] Find best gid index: 0 on mlx5_3/
I0513 11:04:17.524910 5510 rdma_context.cpp:125] RDMA device: mlx5_3, LID: 31, GID: (GID_Index 0) fe:80:00:00:00:00:00:00:a0:88:c2:03:00:2a:13:92
E0513 11:04:18.122572 5509 rdma_context.cpp:198] Failed to register memory 0x7f7a4e000000: Bad address [14]
[2025-05-13 11:04:18 TP4] Mooncake memory registration failed.
[2025-05-13 11:04:18 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2372, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 463, in __init__
self.init_disaggregation()
File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 620, in init_disaggregation
self.disagg_prefill_bootstrap_queue = PrefillBootstrapQueue(
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/disaggregation/prefill.py", line 82, in __init__
self.kv_manager = self._init_kv_manager()
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/disaggregation/prefill.py", line 116, in _init_kv_manager
kv_manager = kv_manager_class(
^^^^^^^^^^^^^^^^^
File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/disaggregation/mooncake/conn.py", line 146, in __init__
self.register_buffer_to_engine()
File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/disaggregation/mooncake/conn.py", line 173, in register_buffer_to_engine
self.engine.register(kv_data_ptr, kv_data_len)
File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/disaggregation/mooncake/transfer_engine.py", line 36, in register
raise RuntimeError("Mooncake memory registration failed.")
RuntimeError: Mooncake memory registration failed.
Closed as resolved in https://github.com/kvcache-ai/Mooncake/issues/351.
I got the same error, what's the version of mooncake of yours?
Have you completely resolved this issue? @feng397
I got the same error, what's the version of mooncake of yours?
Have you completely resolved this issue? @feng397
we also met this problem, no idea about the root cause
I got the same error, what's the version of mooncake of yours?
Have you completely resolved this issue? @feng397
we also met this problem, no idea about the root cause
https://github.com/pytorch/pytorch/issues/153688#issuecomment-2891704714 You can refer to it @mingxiao666