[Usage]: [Handshake] Failed to modify QP to INIT
Describe your usage question
使用英特尔E810的RDMA网卡,
PC1的ip是100.100.100.1, enp5s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4096 inet 100.100.100.1 netmask 255.255.255.0 broadcast 0.0.0.0 txqueuelen 1000 (Ethernet) RX packets 3748 bytes 289039 (289.0 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 16847 bytes 66597484 (66.5 MB) TX errors 425 dropped 0 overruns 0 carrier 0 collisions 0
PC2的IP是100.100.100.2, enp3s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4096 inet 100.100.100.2 netmask 255.255.255.0 broadcast 0.0.0.0 txqueuelen 1000 (以太网) RX packets 16681 bytes 66580672 (66.5 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3938 bytes 310487 (310.4 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
PC1的网卡信息(ibstat): CA 'rocep5s0f0' CA type: Number of ports: 1 Firmware version: 1.74 Hardware version: Node GUID: 0x527c6ffffe3c3b56 System image GUID: 0x527c6ffffe3c3b56 Port 1: State: Active Physical state: LinkUp Rate: 10 (FDR10) Base lid: 1 LMC: 0 SM lid: 0 Capability mask: 0x00050000 Port GUID: 0x527c6ffffe3c3b56 Link layer: Ethernet
PC2的网卡信息: CA 'rocep3s0f1' CA type: Number of ports: 1 Firmware version: 1.74 Hardware version: Node GUID: 0x527c6ffffe1b5f2f System image GUID: 0x527c6ffffe1b5f2f Port 1: State: Active Physical state: LinkUp Rate: 10 (FDR10) Base lid: 1 LMC: 0 SM lid: 0 Capability mask: 0x00050000 Port GUID: 0x527c6ffffe1b5f2f Link layer: Ethernet
使用perftest的ib_write_bw命令测试都是正常的。 ib_write_bw -d rocep3s0f1 ib_write_bw -d rocep5s0f0 -i 1 100.100.100.2 -n 1000 -s 65536
在PC2输入以下命令:
export MC_IB_PORT=1
export MC_GID_INDEX=1
export MC_TE_METRIC=1
./transfer_engine_bench --mode=target
--metadata_server=P2PHANDSHAKE
--local_server_name=100.100.100.2:17999
--device_name=rocep3s0f1
显示:
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1201 14:12:43.102025 31475 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.2 port: 17999
I1201 14:12:43.102074 31476 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s)
I1201 14:12:43.102075 31475 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.2:16470
I1201 14:12:43.102236 31475 rdma_context.cpp:77] Using SIEVE endpoint store
I1201 14:12:43.103399 31475 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep3s0f1/ (with network device)
I1201 14:12:43.103634 31475 rdma_context.cpp:140] RDMA device: rocep3s0f1, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:02
I1201 14:12:43.103649 31475 transfer_engine_bench.cpp:484] DRAM is used, numa node num: 1
然后在PC1输入以下命令:
export MC_IB_PORT=1
export MC_GID_INDEX=1
export MC_TE_METRIC=1
./transfer_engine_bench --metadata_server=P2PHANDSHAKE
--segment_id=100.100.100.2:15659
--local_server_name=100.100.100.1:17510
--device_name=rocep5s0f0
显示:
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1201 14:13:27.688277 1782938 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.1 port: 17510
I1201 14:13:27.688314 1782938 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:16204
I1201 14:13:27.688352 1782940 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s)
I1201 14:13:27.688405 1782938 rdma_context.cpp:77] Using SIEVE endpoint store
I1201 14:13:27.689312 1782938 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep5s0f0/ (with network device)
I1201 14:13:27.689594 1782938 rdma_context.cpp:140] RDMA device: rocep5s0f0, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:01
I1201 14:13:27.689607 1782938 transfer_engine_bench.cpp:359] DRAM is used, numa node num: 1
E1201 14:13:27.820377 1782948 rdma_endpoint.cpp:157] Invalid argument: received packet mismatch, local.local_nic_path: 100.100.100.1:16204@rocep5s0f0, local.peer_nic_path: 100.100.100.2:15659@rocep3s0f1, peer.local_nic_path: , peer.peer_nic_path:
E1201 14:13:27.820391 1782948 worker_pool.cpp:242] Worker: Cannot make connection for endpoint: 100.100.100.2:15659@rocep3s0f1, mark it inactive
I1201 14:13:27.826188 1783001 transfer_engine_bench.cpp:230] FAILED
I1201 14:13:27.826193 1783002 transfer_engine_bench.cpp:230] FAILED
W1201 14:13:27.826195 1783001 transport.h:235] detected slice leak: allocated 128 freed 0
W1201 14:13:27.826205 1783002 transport.h:235] detected slice leak: allocated 128 freed 0
PC2显示: E1201 14:13:27.800498 31634 rdma_endpoint.cpp:385] [Handshake] Failed to modify QP to INIT, check local context port num: Invalid argument [22]
如果在PC1使用segment_id=100.100.100.2:17999的话,即
./transfer_engine_bench --metadata_server=P2PHANDSHAKE
--segment_id=100.100.100.2:17999
--local_server_name=100.100.100.1:17510
--device_name=rocep5s0f0
PC1显示如下:
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1201 14:15:11.417858 1796146 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.1 port: 17510
I1201 14:15:11.417892 1796146 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:15645
I1201 14:15:11.418186 1796146 rdma_context.cpp:77] Using SIEVE endpoint store
I1201 14:15:11.418303 1796148 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s)
I1201 14:15:11.419322 1796146 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep5s0f0/ (with network device)
I1201 14:15:11.419592 1796146 rdma_context.cpp:140] RDMA device: rocep5s0f0, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:01
I1201 14:15:11.419606 1796146 transfer_engine_bench.cpp:359] DRAM is used, numa node num: 1
E1201 14:15:11.545650 1796146 transfer_metadata_plugin.cpp:909] SocketHandShakePlugin: connect()100.100.100.2:17999: Connection refused [111]
E1201 14:15:11.545794 1796209 transfer_engine_bench.cpp:194] Unable to get target segment ID, please recheck
E1201 14:15:11.545843 1796211 transfer_engine_bench.cpp:194] Unable to get target segment ID, please recheck
E1201 14:15:11.545859 1796210 transfer_engine_bench.cpp:194] Unable to get target segment ID, please recheck
指定transfer_engine_bench使用tcp协议的时候也是正常的。
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues and read the documentation
P2PHANDSHAKE模式下,target 会随机分配一个端口号,显示在 I1201 14:13:27.688314 1782938 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:16204 这一行中(即100.100.100.1:16204),这个作为 initiator 端 --segment_id 的值。同时,你不需要指定 --local_server_name 的端口号。(如果不是同机,最好不需要指定这个参数)
PC2输入:
./transfer_engine_bench --mode=target \
> --metadata_server=P2PHANDSHAKE\
> --local_server_name=100.100.100.2 \
> --device_name=rocep3s0f1
然后PC1输入:
./transfer_engine_bench --metadata_server=P2PHANDSHAKE \
> --segment_id=100.100.100.2:16973 \
> --local_server_name=100.100.100.1 \
> --device_name=rocep5s0f0
PC2显示: WARNING: Logging before InitGoogleLogging() is written to STDERR I1202 09:07:40.743103 211549 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.2 port: 12001 I1202 09:07:40.743147 211552 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1202 09:07:40.743148 211549 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.2:16973 I1202 09:07:40.743278 211549 rdma_context.cpp:77] Using SIEVE endpoint store I1202 09:07:40.744488 211549 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep3s0f1/ (with network device) I1202 09:07:40.744724 211549 rdma_context.cpp:140] RDMA device: rocep3s0f1, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:02 I1202 09:07:40.744784 211549 transfer_engine_bench.cpp:484] DRAM is used, numa node num: 1 E1202 09:08:11.591005 211553 rdma_endpoint.cpp:385] [Handshake] Failed to modify QP to INIT, check local context port num: Invalid argument [22]
PC1显示: WARNING: Logging before InitGoogleLogging() is written to STDERR I1202 09:08:11.468353 96190 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.1 port: 12001 I1202 09:08:11.468351 96192 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1202 09:08:11.468426 96190 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:15682 I1202 09:08:11.468611 96190 rdma_context.cpp:77] Using SIEVE endpoint store I1202 09:08:11.470599 96190 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep5s0f0/ (with network device) I1202 09:08:11.471163 96190 rdma_context.cpp:140] RDMA device: rocep5s0f0, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:01 I1202 09:08:11.471196 96190 transfer_engine_bench.cpp:359] DRAM is used, numa node num: 1 E1202 09:08:11.602855 96199 rdma_endpoint.cpp:157] Invalid argument: received packet mismatch, local.local_nic_path: 100.100.100.1:15682@rocep5s0f0, local.peer_nic_path: 100.100.100.2:16973@rocep3s0f1, peer.local_nic_path: , peer.peer_nic_path: E1202 09:08:11.602869 96199 worker_pool.cpp:242] Worker: Cannot make connection for endpoint: 100.100.100.2:16973@rocep3s0f1, mark it inactive I1202 09:08:11.609903 96252 transfer_engine_bench.cpp:230] FAILED I1202 09:08:11.609905 96253 transfer_engine_bench.cpp:230] FAILED W1202 09:08:11.609911 96252 transport.h:235] detected slice leak: allocated 128 freed 0 I1202 09:08:11.609907 96254 transfer_engine_bench.cpp:230] FAILED W1202 09:08:11.609915 96253 transport.h:235] detected slice leak: allocated 128 freed 0 W1202 09:08:11.609930 96254 transport.h:235] detected slice leak: allocated 128 freed 0
不指定 --local_server_name 的端口号,结果好像也是一样。
我现在没有很明确的思路,因为传进去的 port id 似乎是正确的,而转变到INIT阶段的传入参数也没几个:
memset(&attr, 0, sizeof(attr));
attr.qp_state = IBV_QPS_INIT;
attr.port_num = context_.portNum();
attr.pkey_index = 0;
attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ |
IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_ATOMIC;
ret = ibv_modify_qp(
qp, &attr,
IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS);
if (ret) {
std::string message = "Failed to modify QP to INIT, check local context port num";
// ...
}
建议和 ib_write_bw 输出的信息对照一下。是否还有一种可能是你用的设备不支持 IBV_ACCESS_REMOTE_ATOMIC?
还真有可能,换了一张Mellanox的网卡就可以了
The related issue: https://github.com/kvcache-ai/Mooncake/pull/722. FYI
@KaiK025 could you try to test this PR? If it works well, we can merge it.