[Bug]: nvlink_transport_test error
Bug Report
I run nvlink_transport_test in 8*H800 server with nvlink. I try to use nvlink to send kv with PD in two gpu.
root@hpc-05:/home/wq/Mooncake/build/mooncake-transfer-engine/tests# ./nvlink_transport_test -metadata_server 172.16.106.101:2379 -local_server_name 127.0.0.1:12345 -segment_id 127.0.0.1:123456 -gpu_id 0
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NvlinkTransportTest
[ RUN ] NvlinkTransportTest.WriteAndRead
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20251104 10:56:26.699167 140045959401472 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20251104 10:56:26.699298 140045959401472 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 127.0.0.1 port: 12345
I20251104 10:56:26.702242 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface ibs113 with IP 172.16.106.102
I20251104 10:56:26.702263 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface bond0 with IP 10.6.2.80
I20251104 10:56:26.702267 140045959401472 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)
I20251104 10:56:26.702271 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface br-cd1a0b29f5cb with IP 172.18.0.1
I20251104 10:56:26.702303 140045959401472 transfer_engine.cpp:146] Transfer Engine RPC using new RPC mapping, listening on 172.16.106.102:16279
W20251104 10:56:48.522269 140045959401472 nvlink_transport.cpp:385] Memory region 0x7f5c66800000 is not allocated by cuMemCreate, but it can be used as local buffer
W20251104 10:56:48.524547 140045959401472 transfer_metadata.cpp:480] Failed to retrieve segment descriptor, name 127.0.0.1:123456
I20251104 10:56:48.524582 140045959401472 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20251104 10:56:48.524593 140045959401472 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: cuda_client port: 12346
I20251104 10:56:48.526194 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface ibs113 with IP 172.16.106.102
I20251104 10:56:48.526205 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface bond0 with IP 10.6.2.80
I20251104 10:56:48.526210 140045959401472 transfer_metadata_plugin.cpp:1116] Skipping interface docker0 (not UP or not RUNNING)
I20251104 10:56:48.526214 140045959401472 transfer_metadata_plugin.cpp:1127] Found active interface br-cd1a0b29f5cb with IP 172.18.0.1
I20251104 10:56:48.526229 140045959401472 transfer_engine.cpp:146] Transfer Engine RPC using new RPC mapping, listening on 172.16.106.102:15015
W20251104 10:56:48.528450 140045959401472 nvlink_transport.cpp:385] Memory region 0x7f5c67000000 is not allocated by cuMemCreate, but it can be used as local buffer
/home/wq/Mooncake/mooncake-transfer-engine/tests/nvlink_transport_test.cpp:89: Failure
Value of: s.ok()
Actual: false
Expected: true
[ FAILED ] NvlinkTransportTest.WriteAndRead (21835 ms)
[----------] 1 test from NvlinkTransportTest (21835 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (21835 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] NvlinkTransportTest.WriteAndRead
1 FAILED TEST
Before submitting...
- [x] Ensure you searched for relevant issues and read the [documentation]
The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.
The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.
single node server can not use NVLink P2P to transmit kv cache?
The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.
What are multi-node NVLink clusters? Can't the transmission be done using nvlink on a single machine (for example, with 8 cards)? @alogfans
The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.
What are multi-node NVLink clusters? Can't the transmission be done using nvlink on a single machine (for example, with 8 cards)? @alogfans
NVL72 is multi-node NVLink
The NVLink transport currently available in multi-node NVLink clusters only. For regular NVLinks, you can use the RDMA transport.
What are multi-node NVLink clusters? Can't the transmission be done using nvlink on a single machine (for example, with 8 cards)? @alogfans
@thqq479 Currently, nvlink on a single machine is not supported yet. Maybe try this branch: https://github.com/kvcache-ai/Mooncake/tree/tent.