Mooncake [Bug]: nvlink_transport

Bug Report

I run nvlink_transport_test in 8*H800 server with nvlink. I try to use nvlink to send kv with PD in two gpu.

root@hpc-05:/home/wq/Mooncake/build/mooncake-transfer-engine/tests# ./nvlink_transport_test  -metadata_server 172.16.106.101:2379 -local_server_name 127.0.0.1:12345 -segment_id 127.0.0.1:123456 -gpu_id 0
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NvlinkTransportTest
[ RUN      ] NvlinkTransportTest.WriteAndRead
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20251104 10:56:26.699167 140045959401472 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20251104 10:56:26.699298 140045959401472 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 127.0.0.1 port: 12345
I20251104 10:56:26.702242 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface ibs113 with IP 172.16.106.102
I20251104 10:56:26.702263 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface bond0 with IP 10.6.2.80
I20251104 10:56:26.702267 140045959401472 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)
I20251104 10:56:26.702271 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface br-cd1a0b29f5cb with IP 172.18.0.1
I20251104 10:56:26.702303 140045959401472 transfer_engine.cpp:146] Transfer Engine RPC using new RPC mapping, listening on 172.16.106.102:16279
W20251104 10:56:48.522269 140045959401472 nvlink_transport.cpp:385] Memory region 0x7f5c66800000 is not allocated by cuMemCreate, but it can be used as local buffer
W20251104 10:56:48.524547 140045959401472 transfer_metadata.cpp:480] Failed to retrieve segment descriptor, name 127.0.0.1:123456
I20251104 10:56:48.524582 140045959401472 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20251104 10:56:48.524593 140045959401472 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: cuda_client port: 12346
I20251104 10:56:48.526194 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface ibs113 with IP 172.16.106.102
I20251104 10:56:48.526205 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface bond0 with IP 10.6.2.80
I20251104 10:56:48.526210 140045959401472 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)
I20251104 10:56:48.526214 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface br-cd1a0b29f5cb with IP 172.18.0.1
I20251104 10:56:48.526229 140045959401472 transfer_engine.cpp:146] Transfer Engine RPC using new RPC mapping, listening on 172.16.106.102:15015
W20251104 10:56:48.528450 140045959401472 nvlink_transport.cpp:385] Memory region 0x7f5c67000000 is not allocated by cuMemCreate, but it can be used as local buffer
/home/wq/Mooncake/mooncake-transfer-engine/tests/nvlink_transport_test.cpp:89: Failure
Value of: s.ok()
  Actual: false
Expected: true
[  FAILED  ] NvlinkTransportTest.WriteAndRead (21835 ms)
[----------] 1 test from NvlinkTransportTest (21835 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (21835 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NvlinkTransportTest.WriteAndRead

 1 FAILED TEST

Before submitting...

[x] Ensure you searched for relevant issues and read the [documentation]

Nov 04 '25 14:11 wqlxx

The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.

Nov 05 '25 01:11 alogfans

The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.

single node server can not use NVLink P2P to transmit kv cache?

Nov 05 '25 01:11 wqlxx

The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.

What are multi-node NVLink clusters? Can't the transmission be done using nvlink on a single machine (for example, with 8 cards)? @alogfans

Nov 05 '25 03:11 thqq479

The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.

What are multi-node NVLink clusters? Can't the transmission be done using nvlink on a single machine (for example, with 8 cards)? @alogfans

NVL72 is multi-node NVLink

Nov 05 '25 03:11 wqlxx

The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.

What are multi-node NVLink clusters? Can't the transmission be done using nvlink on a single machine (for example, with 8 cards)? @alogfans

@thqq479 Currently, nvlink on a single machine is not supported yet. Maybe try this branch: https://github.com/kvcache-ai/Mooncake/tree/tent.

Nov 05 '25 07:11 ShangmingCai

[Bug]: nvlink_transport_test error

Bug Report

Before submitting...