[Handshake] Failed to modify QP to INIT
使用英特尔E810的RDMA网卡,
PC1的ip是100.100.100.1, enp5s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4096 inet 100.100.100.1 netmask 255.255.255.0 broadcast 0.0.0.0 txqueuelen 1000 (Ethernet) RX packets 3748 bytes 289039 (289.0 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 16847 bytes 66597484 (66.5 MB) TX errors 425 dropped 0 overruns 0 carrier 0 collisions 0
PC2的IP是100.100.100.2, enp3s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4096 inet 100.100.100.2 netmask 255.255.255.0 broadcast 0.0.0.0 txqueuelen 1000 (以太网) RX packets 16681 bytes 66580672 (66.5 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3938 bytes 310487 (310.4 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
PC1的网卡信息(ibstat): CA 'rocep5s0f0' CA type: Number of ports: 1 Firmware version: 1.74 Hardware version: Node GUID: 0x527c6ffffe3c3b56 System image GUID: 0x527c6ffffe3c3b56 Port 1: State: Active Physical state: LinkUp Rate: 10 (FDR10) Base lid: 1 LMC: 0 SM lid: 0 Capability mask: 0x00050000 Port GUID: 0x527c6ffffe3c3b56 Link layer: Ethernet
PC2的网卡信息: CA 'rocep3s0f1' CA type: Number of ports: 1 Firmware version: 1.74 Hardware version: Node GUID: 0x527c6ffffe1b5f2f System image GUID: 0x527c6ffffe1b5f2f Port 1: State: Active Physical state: LinkUp Rate: 10 (FDR10) Base lid: 1 LMC: 0 SM lid: 0 Capability mask: 0x00050000 Port GUID: 0x527c6ffffe1b5f2f Link layer: Ethernet
使用perftest的ib_write_bw命令测试都是正常的。 ib_write_bw -d rocep3s0f1 ib_write_bw -d rocep5s0f0 -i 1 100.100.100.2 -n 1000 -s 65536
在PC2输入以下命令:
export MC_IB_PORT=1
export MC_GID_INDEX=1
export MC_TE_METRIC=1
./transfer_engine_bench --mode=target \
--metadata_server=P2PHANDSHAKE\
--local_server_name=100.100.100.2:17999 \
--device_name=rocep3s0f1
显示: WARNING: Logging before InitGoogleLogging() is written to STDERR I1201 14:12:43.102025 31475 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.2 port: 17999 I1201 14:12:43.102074 31476 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1201 14:12:43.102075 31475 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.2:16470 I1201 14:12:43.102236 31475 rdma_context.cpp:77] Using SIEVE endpoint store I1201 14:12:43.103399 31475 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep3s0f1/ (with network device) I1201 14:12:43.103634 31475 rdma_context.cpp:140] RDMA device: rocep3s0f1, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:02 I1201 14:12:43.103649 31475 transfer_engine_bench.cpp:484] DRAM is used, numa node num: 1
然后在PC1输入以下命令:
export MC_IB_PORT=1
export MC_GID_INDEX=1
export MC_TE_METRIC=1
./transfer_engine_bench --metadata_server=P2PHANDSHAKE \
--segment_id=100.100.100.2:15659 \
--local_server_name=100.100.100.1:17510 \
--device_name=rocep5s0f0
显示: WARNING: Logging before InitGoogleLogging() is written to STDERR I1201 14:13:27.688277 1782938 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.1 port: 17510 I1201 14:13:27.688314 1782938 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:16204 I1201 14:13:27.688352 1782940 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1201 14:13:27.688405 1782938 rdma_context.cpp:77] Using SIEVE endpoint store I1201 14:13:27.689312 1782938 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep5s0f0/ (with network device) I1201 14:13:27.689594 1782938 rdma_context.cpp:140] RDMA device: rocep5s0f0, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:01 I1201 14:13:27.689607 1782938 transfer_engine_bench.cpp:359] DRAM is used, numa node num: 1 E1201 14:13:27.820377 1782948 rdma_endpoint.cpp:157] Invalid argument: received packet mismatch, local.local_nic_path: 100.100.100.1:16204@rocep5s0f0, local.peer_nic_path: 100.100.100.2:15659@rocep3s0f1, peer.local_nic_path: , peer.peer_nic_path: E1201 14:13:27.820391 1782948 worker_pool.cpp:242] Worker: Cannot make connection for endpoint: 100.100.100.2:15659@rocep3s0f1, mark it inactive I1201 14:13:27.826188 1783001 transfer_engine_bench.cpp:230] FAILED I1201 14:13:27.826193 1783002 transfer_engine_bench.cpp:230] FAILED W1201 14:13:27.826195 1783001 transport.h:235] detected slice leak: allocated 128 freed 0 W1201 14:13:27.826205 1783002 transport.h:235] detected slice leak: allocated 128 freed 0
PC2显示: E1201 14:13:27.800498 31634 rdma_endpoint.cpp:385] [Handshake] Failed to modify QP to INIT, check local context port num: Invalid argument [22]
如果在PC1使用segment_id=100.100.100.2:17999的话,即
./transfer_engine_bench --metadata_server=P2PHANDSHAKE \
--segment_id=100.100.100.2:17999 \
--local_server_name=100.100.100.1:17510 \
--device_name=rocep5s0f0
PC1显示如下: WARNING: Logging before InitGoogleLogging() is written to STDERR I1201 14:15:11.417858 1796146 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.1 port: 17510 I1201 14:15:11.417892 1796146 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:15645 I1201 14:15:11.418186 1796146 rdma_context.cpp:77] Using SIEVE endpoint store I1201 14:15:11.418303 1796148 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1201 14:15:11.419322 1796146 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep5s0f0/ (with network device) I1201 14:15:11.419592 1796146 rdma_context.cpp:140] RDMA device: rocep5s0f0, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:01 I1201 14:15:11.419606 1796146 transfer_engine_bench.cpp:359] DRAM is used, numa node num: 1 E1201 14:15:11.545650 1796146 transfer_metadata_plugin.cpp:909] SocketHandShakePlugin: connect()100.100.100.2:17999: Connection refused [111] E1201 14:15:11.545794 1796209 transfer_engine_bench.cpp:194] Unable to get target segment ID, please recheck E1201 14:15:11.545843 1796211 transfer_engine_bench.cpp:194] Unable to get target segment ID, please recheck E1201 14:15:11.545859 1796210 transfer_engine_bench.cpp:194] Unable to get target segment ID, please recheck
指定transfer_engine_bench使用tcp协议的时候也是正常的。
Before submitting...
- [x] Ensure you searched for relevant issues and read the [documentation]