Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[Usage]: [Handshake] Failed to modify QP to INIT

Open KaiK025 opened this issue 1 month ago • 6 comments

Describe your usage question

使用英特尔E810的RDMA网卡,

PC1的ip是100.100.100.1, enp5s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4096 inet 100.100.100.1 netmask 255.255.255.0 broadcast 0.0.0.0 txqueuelen 1000 (Ethernet) RX packets 3748 bytes 289039 (289.0 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 16847 bytes 66597484 (66.5 MB) TX errors 425 dropped 0 overruns 0 carrier 0 collisions 0

PC2的IP是100.100.100.2, enp3s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4096 inet 100.100.100.2 netmask 255.255.255.0 broadcast 0.0.0.0 txqueuelen 1000 (以太网) RX packets 16681 bytes 66580672 (66.5 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3938 bytes 310487 (310.4 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

PC1的网卡信息(ibstat): CA 'rocep5s0f0' CA type: Number of ports: 1 Firmware version: 1.74 Hardware version: Node GUID: 0x527c6ffffe3c3b56 System image GUID: 0x527c6ffffe3c3b56 Port 1: State: Active Physical state: LinkUp Rate: 10 (FDR10) Base lid: 1 LMC: 0 SM lid: 0 Capability mask: 0x00050000 Port GUID: 0x527c6ffffe3c3b56 Link layer: Ethernet

PC2的网卡信息: CA 'rocep3s0f1' CA type: Number of ports: 1 Firmware version: 1.74 Hardware version: Node GUID: 0x527c6ffffe1b5f2f System image GUID: 0x527c6ffffe1b5f2f Port 1: State: Active Physical state: LinkUp Rate: 10 (FDR10) Base lid: 1 LMC: 0 SM lid: 0 Capability mask: 0x00050000 Port GUID: 0x527c6ffffe1b5f2f Link layer: Ethernet

使用perftest的ib_write_bw命令测试都是正常的。 ib_write_bw -d rocep3s0f1 ib_write_bw -d rocep5s0f0 -i 1 100.100.100.2 -n 1000 -s 65536

在PC2输入以下命令:

export MC_IB_PORT=1 export MC_GID_INDEX=1 export MC_TE_METRIC=1 ./transfer_engine_bench --mode=target
--metadata_server=P2PHANDSHAKE
--local_server_name=100.100.100.2:17999
--device_name=rocep3s0f1 显示: WARNING: Logging before InitGoogleLogging() is written to STDERR I1201 14:12:43.102025 31475 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.2 port: 17999 I1201 14:12:43.102074 31476 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1201 14:12:43.102075 31475 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.2:16470 I1201 14:12:43.102236 31475 rdma_context.cpp:77] Using SIEVE endpoint store I1201 14:12:43.103399 31475 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep3s0f1/ (with network device) I1201 14:12:43.103634 31475 rdma_context.cpp:140] RDMA device: rocep3s0f1, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:02 I1201 14:12:43.103649 31475 transfer_engine_bench.cpp:484] DRAM is used, numa node num: 1

然后在PC1输入以下命令:

export MC_IB_PORT=1 export MC_GID_INDEX=1 export MC_TE_METRIC=1 ./transfer_engine_bench --metadata_server=P2PHANDSHAKE
--segment_id=100.100.100.2:15659
--local_server_name=100.100.100.1:17510
--device_name=rocep5s0f0 显示: WARNING: Logging before InitGoogleLogging() is written to STDERR I1201 14:13:27.688277 1782938 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.1 port: 17510 I1201 14:13:27.688314 1782938 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:16204 I1201 14:13:27.688352 1782940 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1201 14:13:27.688405 1782938 rdma_context.cpp:77] Using SIEVE endpoint store I1201 14:13:27.689312 1782938 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep5s0f0/ (with network device) I1201 14:13:27.689594 1782938 rdma_context.cpp:140] RDMA device: rocep5s0f0, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:01 I1201 14:13:27.689607 1782938 transfer_engine_bench.cpp:359] DRAM is used, numa node num: 1 E1201 14:13:27.820377 1782948 rdma_endpoint.cpp:157] Invalid argument: received packet mismatch, local.local_nic_path: 100.100.100.1:16204@rocep5s0f0, local.peer_nic_path: 100.100.100.2:15659@rocep3s0f1, peer.local_nic_path: , peer.peer_nic_path: E1201 14:13:27.820391 1782948 worker_pool.cpp:242] Worker: Cannot make connection for endpoint: 100.100.100.2:15659@rocep3s0f1, mark it inactive I1201 14:13:27.826188 1783001 transfer_engine_bench.cpp:230] FAILED I1201 14:13:27.826193 1783002 transfer_engine_bench.cpp:230] FAILED W1201 14:13:27.826195 1783001 transport.h:235] detected slice leak: allocated 128 freed 0 W1201 14:13:27.826205 1783002 transport.h:235] detected slice leak: allocated 128 freed 0

PC2显示: E1201 14:13:27.800498 31634 rdma_endpoint.cpp:385] [Handshake] Failed to modify QP to INIT, check local context port num: Invalid argument [22]

如果在PC1使用segment_id=100.100.100.2:17999的话,即

./transfer_engine_bench --metadata_server=P2PHANDSHAKE
--segment_id=100.100.100.2:17999
--local_server_name=100.100.100.1:17510
--device_name=rocep5s0f0 PC1显示如下: WARNING: Logging before InitGoogleLogging() is written to STDERR I1201 14:15:11.417858 1796146 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.1 port: 17510 I1201 14:15:11.417892 1796146 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:15645 I1201 14:15:11.418186 1796146 rdma_context.cpp:77] Using SIEVE endpoint store I1201 14:15:11.418303 1796148 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1201 14:15:11.419322 1796146 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep5s0f0/ (with network device) I1201 14:15:11.419592 1796146 rdma_context.cpp:140] RDMA device: rocep5s0f0, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:01 I1201 14:15:11.419606 1796146 transfer_engine_bench.cpp:359] DRAM is used, numa node num: 1 E1201 14:15:11.545650 1796146 transfer_metadata_plugin.cpp:909] SocketHandShakePlugin: connect()100.100.100.2:17999: Connection refused [111] E1201 14:15:11.545794 1796209 transfer_engine_bench.cpp:194] Unable to get target segment ID, please recheck E1201 14:15:11.545843 1796211 transfer_engine_bench.cpp:194] Unable to get target segment ID, please recheck E1201 14:15:11.545859 1796210 transfer_engine_bench.cpp:194] Unable to get target segment ID, please recheck

指定transfer_engine_bench使用tcp协议的时候也是正常的。

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues and read the documentation

KaiK025 avatar Dec 01 '25 07:12 KaiK025

P2PHANDSHAKE模式下,target 会随机分配一个端口号,显示在 I1201 14:13:27.688314 1782938 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:16204 这一行中(即100.100.100.1:16204),这个作为 initiator 端 --segment_id 的值。同时,你不需要指定 --local_server_name 的端口号。(如果不是同机,最好不需要指定这个参数)

alogfans avatar Dec 01 '25 13:12 alogfans

PC2输入:

 ./transfer_engine_bench --mode=target \
>     --metadata_server=P2PHANDSHAKE\
>     --local_server_name=100.100.100.2 \
> --device_name=rocep3s0f1

然后PC1输入:

./transfer_engine_bench --metadata_server=P2PHANDSHAKE \
>     --segment_id=100.100.100.2:16973 \
>     --local_server_name=100.100.100.1 \
> --device_name=rocep5s0f0

PC2显示: WARNING: Logging before InitGoogleLogging() is written to STDERR I1202 09:07:40.743103 211549 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.2 port: 12001 I1202 09:07:40.743147 211552 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1202 09:07:40.743148 211549 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.2:16973 I1202 09:07:40.743278 211549 rdma_context.cpp:77] Using SIEVE endpoint store I1202 09:07:40.744488 211549 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep3s0f1/ (with network device) I1202 09:07:40.744724 211549 rdma_context.cpp:140] RDMA device: rocep3s0f1, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:02 I1202 09:07:40.744784 211549 transfer_engine_bench.cpp:484] DRAM is used, numa node num: 1 E1202 09:08:11.591005 211553 rdma_endpoint.cpp:385] [Handshake] Failed to modify QP to INIT, check local context port num: Invalid argument [22]

PC1显示: WARNING: Logging before InitGoogleLogging() is written to STDERR I1202 09:08:11.468353 96190 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 100.100.100.1 port: 12001 I1202 09:08:11.468351 96192 transfer_engine.cpp:493] Metrics reporting thread started (interval: 5s) I1202 09:08:11.468426 96190 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 100.100.100.1:15682 I1202 09:08:11.468611 96190 rdma_context.cpp:77] Using SIEVE endpoint store I1202 09:08:11.470599 96190 rdma_context.cpp:586] Using user-specified GID index: 1 on rocep5s0f0/ (with network device) I1202 09:08:11.471163 96190 rdma_context.cpp:140] RDMA device: rocep5s0f0, LID: 1, GID: (GID_Index 1) 00:00:00:00:00:00:00:00:00:00:ff:ff:64:64:64:01 I1202 09:08:11.471196 96190 transfer_engine_bench.cpp:359] DRAM is used, numa node num: 1 E1202 09:08:11.602855 96199 rdma_endpoint.cpp:157] Invalid argument: received packet mismatch, local.local_nic_path: 100.100.100.1:15682@rocep5s0f0, local.peer_nic_path: 100.100.100.2:16973@rocep3s0f1, peer.local_nic_path: , peer.peer_nic_path: E1202 09:08:11.602869 96199 worker_pool.cpp:242] Worker: Cannot make connection for endpoint: 100.100.100.2:16973@rocep3s0f1, mark it inactive I1202 09:08:11.609903 96252 transfer_engine_bench.cpp:230] FAILED I1202 09:08:11.609905 96253 transfer_engine_bench.cpp:230] FAILED W1202 09:08:11.609911 96252 transport.h:235] detected slice leak: allocated 128 freed 0 I1202 09:08:11.609907 96254 transfer_engine_bench.cpp:230] FAILED W1202 09:08:11.609915 96253 transport.h:235] detected slice leak: allocated 128 freed 0 W1202 09:08:11.609930 96254 transport.h:235] detected slice leak: allocated 128 freed 0

不指定 --local_server_name 的端口号,结果好像也是一样。

KaiK025 avatar Dec 02 '25 01:12 KaiK025

我现在没有很明确的思路,因为传进去的 port id 似乎是正确的,而转变到INIT阶段的传入参数也没几个:

    memset(&attr, 0, sizeof(attr));
    attr.qp_state = IBV_QPS_INIT;
    attr.port_num = context_.portNum();
    attr.pkey_index = 0;
    attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ |
                           IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_ATOMIC;
    ret = ibv_modify_qp(
        qp, &attr,
        IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS);
    if (ret) {
        std::string message = "Failed to modify QP to INIT, check local context port num";
        // ...
    }

建议和 ib_write_bw 输出的信息对照一下。是否还有一种可能是你用的设备不支持 IBV_ACCESS_REMOTE_ATOMIC?

alogfans avatar Dec 02 '25 06:12 alogfans

还真有可能,换了一张Mellanox的网卡就可以了

KaiK025 avatar Dec 02 '25 06:12 KaiK025

The related issue: https://github.com/kvcache-ai/Mooncake/pull/722. FYI

stmatengss avatar Dec 02 '25 16:12 stmatengss

@KaiK025 could you try to test this PR? If it works well, we can merge it.

stmatengss avatar Dec 02 '25 16:12 stmatengss