brpc icon indicating copy to clipboard operation
brpc copied to clipboard

多容器场景下BRPC应用RDMA GID索引匹配错误修复

Open sunce4t opened this issue 2 months ago • 5 comments

Describe the bug 目前,BRPC在选择RDMA设备的GID时,其流程是首先通过ibv_query_port获取GID表的大小(gid_tbl_len),进而通过ibv_query_gid从高索引到低索引反向遍历该表,并选择首个可用的GID。

问题在于,当容器需要RDMA能力时,需要ip link add新设备并配置ip,指定ipvlan master为主网卡eth0,系统会动态地向宿主机的RDMA设备GID表中添加新的条目。在单一宿主机上部署多个容器时,每个容器都会向同一张GID表添加其条目。

此时,若用户未显式指定gid_index,BRPC的通用遍历逻辑会扫描整个GID表。由于缺乏容器级别的隔离感知能力,该逻辑可能使一个容器错误地选中属于另一个容器的GID,导致在初始化时报错:

W1017 15:55:46.241584 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:1237 BringUpQp] Fail to modify QP from INIT to RTR: No such device
W1017 15:55:46.241602 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:520 ProcessHandshakeAtClient] Fail to bringup QP, fallback to tcp:brpc::Socket

不仅仅是容器内使用会报错,只要这台宿主机上有任一一个容器,在宿主机上使用BRPC时均会出现此问题。

To Reproduce

在任意一台宿主机上启动多个使用的容器,并执行rdma_performance下的测试程序即可。 如下,第一次启动一个容器,此时所用GID的index为5:

./client --servers=xxxxx:12000 --attachment_size=5000 --rdma_zerocopy_min_size=16 --rpc_timeout_ms=2000
I1017 15:52:13.284984 10684     0 src/brpc/rdma/rdma_helper.cpp:387 ReadRdmaDynamicLib] Successfully loaded libibverbs.so
I1017 15:52:13.291336 10684     0 src/brpc/rdma/rdma_helper.cpp:529 GlobalRdmaInitializeOrDieImpl] RDMA device: mlx5_0
I1017 15:52:13.291347 10684     0 src/brpc/rdma/rdma_helper.cpp:531 GlobalRdmaInitializeOrDieImpl] RDMA LID: 0
I1017 15:52:13.291438 10684     0 src/brpc/rdma/rdma_helper.cpp:536 GlobalRdmaInitializeOrDieImpl] RDMA GID Index: 5
I1017 15:52:13.292021 10684     0 src/brpc/rdma/block_pool.cpp:214 ExtendBlockPool] Start extend rdma memory 1024MB
I1017 15:52:14.061898 10684     0 src/brpc/server.cpp:1260 StartInternal] Server[DummyServerOf(./client)] is serving on port=8001.
[Threads: 1, Depth: 1, Attachment: 5000B, RDMA: yes, Echo: no]
Avg-Latency: 49, 90th-Latency: 52, 99th-Latency: 58, 99.9th-Latency: 63, Throughput: 94.1512MB/s, QPS: 19k, Server CPU-utilization: 53%, Client CPU-utilization: 21%

在此机器启动第二个容器,第二个容器中不做任何操作,继续在第一个容器执行测试程序,所用GID的index为7:

./client --servers=xxxxx:12000 --attachment_size=5000 --rdma_zerocopy_min_size=16 --rpc_timeout_ms=2000
I1017 15:54:50.501079 10728     0 src/brpc/rdma/rdma_helper.cpp:387 ReadRdmaDynamicLib] Successfully loaded libibverbs.so
I1017 15:54:50.507048 10728     0 src/brpc/rdma/rdma_helper.cpp:529 GlobalRdmaInitializeOrDieImpl] RDMA device: mlx5_0
I1017 15:54:50.507060 10728     0 src/brpc/rdma/rdma_helper.cpp:531 GlobalRdmaInitializeOrDieImpl] RDMA LID: 0
I1017 15:54:50.507147 10728     0 src/brpc/rdma/rdma_helper.cpp:536 GlobalRdmaInitializeOrDieImpl] RDMA GID Index: 7
I1017 15:54:50.507764 10728     0 src/brpc/rdma/block_pool.cpp:214 ExtendBlockPool] Start extend rdma memory 1024MB
I1017 15:54:51.261788 10728     0 src/brpc/server.cpp:1260 StartInternal] Server[DummyServerOf(./client)] is serving on port=8001.
[Threads: 1, Depth: 1, Attachment: 5000B, RDMA: yes, Echo: no]
W1017 15:54:51.266231 10731 4294967297 src/brpc/rdma/rdma_endpoint.cpp:1237 BringUpQp] Fail to modify QP from INIT to RTR: No such device
W1017 15:54:51.266251 10731 4294967297 src/brpc/rdma/rdma_endpoint.cpp:520 ProcessHandshakeAtClient] Fail to bringup QP, fallback to tcp:brpc::Socket

我们的宿主机上默认是有4个GID的,每次启动容器增加两个GID;因此第二次启动容器后的GID index=7也符合预期; 在此机器启动第三个容器,第三个容器中不做任何操作,继续在第一个容器执行测试程序,所用GID的index为9:

 ./client --servers=xxxxx:12000 --attachment_size=5000 --rdma_zerocopy_min_size=16 --rpc_timeout_ms=2000
I1017 15:55:45.475493 10759     0 src/brpc/rdma/rdma_helper.cpp:387 ReadRdmaDynamicLib] Successfully loaded libibverbs.so
I1017 15:55:45.481561 10759     0 src/brpc/rdma/rdma_helper.cpp:529 GlobalRdmaInitializeOrDieImpl] RDMA device: mlx5_0
I1017 15:55:45.481573 10759     0 src/brpc/rdma/rdma_helper.cpp:531 GlobalRdmaInitializeOrDieImpl] RDMA LID: 0
I1017 15:55:45.481661 10759     0 src/brpc/rdma/rdma_helper.cpp:536 GlobalRdmaInitializeOrDieImpl] RDMA GID Index: 9
I1017 15:55:45.482310 10759     0 src/brpc/rdma/block_pool.cpp:214 ExtendBlockPool] Start extend rdma memory 1024MB
I1017 15:55:46.237825 10759     0 src/brpc/server.cpp:1260 StartInternal] Server[DummyServerOf(./client)] is serving on port=8001.
[Threads: 1, Depth: 1, Attachment: 5000B, RDMA: yes, Echo: no]
W1017 15:55:46.241584 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:1237 BringUpQp] Fail to modify QP from INIT to RTR: No such device
W1017 15:55:46.241602 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:520 ProcessHandshakeAtClient] Fail to bringup QP, fallback to tcp:brpc::Socket

Expected behavior BRPC选择GID时应当排除非本容器的GID,这可以通过

cat /sys/class/infiniband/{device_name}/ports/{port_num}/gids/{gid_index}

进行判断。这与通过ibv_query_gid获取GID不同,该目录下并不会出现其他容器的GID。

我们增加了一个patch以过滤非本容器的GID,如下:

diff --git a/src/brpc/rdma/rdma_helper.cpp b/src/brpc/rdma/rdma_helper.cpp
index 9bad3375..2106eaf2 100644
--- a/src/brpc/rdma/rdma_helper.cpp
+++ b/src/brpc/rdma/rdma_helper.cpp
@@ -21,6 +21,7 @@
 #include <pthread.h>
 #include <stdlib.h>
 #include <vector>
+#include <fstream>
 #include <gflags/gflags.h>
 #include "butil/containers/flat_map.h"            // butil::FlatMap
 #include "butil/fd_guard.h"
@@ -216,6 +217,30 @@ static void FindRdmaLid() {
     return;
 }

+
+static int IsSelfGid(const std::string& device_name, int port_num, int gid_index) {
+    std::string path = "/sys/class/infiniband/" + device_name +
+                      "/ports/" + std::to_string(port_num) +
+                      "/gids/" + std::to_string(gid_index);
+
+    std::ifstream file(path);
+    if (!file.is_open()) {
+        return -1;
+    }
+
+    std::string line;
+    if (!std::getline(file, line)) {
+       return -2;
+    }
+
+    if (line == "0000:0000:0000:0000:0000:0000:0000:0000" ||
+        line == "::" ||
+        line == "0000:0000:0000:0000:0000:ffff:0000:0000" ) {
+        return 1;
+    }
+    return 0;
+}
+
static bool FindRdmaGid(ibv_context* context) {
     bool found = false;
     for (int i = g_gid_tbl_len - 1; i >= 0; --i) {
@@ -223,14 +248,23 @@ static bool FindRdmaGid(ibv_context* context) {
         if (IbvQueryGid(context, g_port_num, i, &gid) != 0) {
             continue;
         }
+
         if (gid.global.interface_id == 0) {
             continue;
         }
+
         if (FLAGS_rdma_gid_index == i) {
             g_gid = gid;
             g_gid_index = i;
             return true;
         }
+
+       const char* device_name_cstr = IbvGetDeviceName(context->device);
+       std::string device_name(device_name_cstr);
+       if(IsSelfGid(device_name, g_port_num, i) != 0) {
+           continue;
+       }
+

Versions OS: linux 6.1.52-9 Compiler: 与编译器无关 brpc: commit id 为 7229c3608f8cb98b24a0a2e7f99bc01d357d9312 protobuf: 与protobuf无关

Additional context/screenshots

sunce4t avatar Oct 17 '25 08:10 sunce4t

FLAGS_rdma_device 指定能否解决这个问题呢? 如果不能的话,可以定义一个rdma_device_gid_filter函数,用于用户实现过滤规则,IsSelfGid这个规则对于你所在的环境是没问题的,但是其它的环境不清楚是否适用。

yanglimingcn avatar Oct 20 '25 02:10 yanglimingcn

FLAGS_rdma_device 指定能否解决这个问题呢? 如果不能的话,可以定义一个rdma_device_gid_filter函数,用于用户实现过滤规则,IsSelfGid这个规则对于你所在的环境是没问题的,但是其它的环境不清楚是否适用。

@yanglimingcn hi,这个只需要指定FLAGS_rdma_gid_index 就能解决,不过这样就要求使用前必须先使用show_gids去查看本容器的gid,对业务来讲这会麻烦一些; IsSelfGid就是实现了show_gids的逻辑,如下是show_gids的脚本,其实也是查看 /sys/class/infiniband下的子目录确定有哪些gid的,所以我感觉差不多。

其它的环境 想问下是指什么呢

cat /usr/sbin/show_gids
#!/bin/bash
#
# Copyright (c) 2016 Mellanox Technologies. All rights reserved.
#
# This Software is licensed under one of the following licenses:
#
# 1) under the terms of the "Common Public License 1.0" a copy of which is
#    available from the Open Source Initiative, see
#    http://www.opensource.org/licenses/cpl.php.
#
# 2) under the terms of the "The BSD License" a copy of which is
#    available from the Open Source Initiative, see
#    http://www.opensource.org/licenses/bsd-license.php.
#
# 3) under the terms of the "GNU General Public License (GPL) Version 2" a
#    copy of which is available from the Open Source Initiative, see
#    http://www.opensource.org/licenses/gpl-license.php.
#
# Licensee has the right to choose one of the above licenses.
#
# Redistributions of source code must retain the above copyright
# notice and one of the license notices.
#
# Redistributions in binary form must reproduce both the above copyright
# notice, one of the license notices in the documentation
# and/or other materials provided with the distribution.
#
# Author: Moni Shoua <[email protected]>
#

black='\E[30;50m'
red='\E[31;50m'
green='\E[32;50m'
yellow='\E[33;50m'
blue='\E[34;50m'
magenta='\E[35;50m'
cyan='\E[36;50m'
white='\E[37;50m'

bold='\033[1m'

gid_count=0

# cecho (color echo) prints text in color.
# first parameter should be the desired color followed by text
function cecho ()
{
 echo -en $1
 shift
 echo -n $*
 tput sgr0
}

# becho (color echo) prints text in bold.
becho ()
{
 echo -en $bold
 echo -n $*
 tput sgr0
}

function print_gids()
{
 dev=$1
 port=$2
 for gf in /sys/class/infiniband/$dev/ports/$port/gids/* ; do
      gid=$(cat $gf);
      if [ $gid = 0000:0000:0000:0000:0000:0000:0000:0000 ] ; then
       continue
      fi
      echo -e $(basename $gf) "\t" $gid
 done
}

echo -e "DEV\tPORT\tINDEX\tGID\t\t\t\t\tIPv4  \t\tVER\tDEV"
echo -e "---\t----\t-----\t---\t\t\t\t\t------------  \t---\t---"
DEVS=$1
if [ -z "$DEVS" ] ; then
 DEVS=$(ls /sys/class/infiniband/)
fi
for d in $DEVS ; do
 for p in $(ls /sys/class/infiniband/$d/ports/) ; do
      for g in $(ls /sys/class/infiniband/$d/ports/$p/gids/) ; do
       gid=$(cat /sys/class/infiniband/$d/ports/$p/gids/$g);
       if [ $gid = 0000:0000:0000:0000:0000:0000:0000:0000 ] ; then
        continue
       fi
       if [ $gid = fe80:0000:0000:0000:0000:0000:0000:0000 ] ; then
        continue
       fi
       _ndev=$(cat /sys/class/infiniband/$d/ports/$p/gid_attrs/ndevs/$g 2>/dev/null)
       __type=$(cat /sys/class/infiniband/$d/ports/$p/gid_attrs/types/$g 2>/dev/null)
       _type=$(echo $__type| grep -o "[Vv].*")
       if [ $(echo $gid | cut -d ":" -f -1) = "0000" ] ; then
        ipv4=$(printf "%d.%d.%d.%d" 0x${gid:30:2} 0x${gid:32:2} 0x${gid:35:2} 0x${gid:37:2})
        echo -e "$d\t$p\t$g\t$gid\t$ipv4  \t$_type\t$_ndev"
       else
        echo -e "$d\t$p\t$g\t$gid\t\t\t$_type\t$_ndev"
       fi
       gid_count=$(expr 1 + $gid_count)
      done #g (gid)
 done #p (port)
done #d (dev)

echo n_gids_found=$gid_count

sunce4t avatar Oct 20 '25 03:10 sunce4t

嗯,感觉写到一个启动脚本里面也可以。我说的其它环境是指非linux环境,比如mac,windows,或者linux环境也有很多,不确定这个处理逻辑是否都能兼容,所以感觉还是提供一个比较通用的方式,能达到平台无关比较好。

yanglimingcn avatar Oct 21 '25 06:10 yanglimingcn

嗯,感觉写到一个启动脚本里面也可以。我说的其它环境是指非linux环境,比如mac,windows,或者linux环境也有很多,不确定这个处理逻辑是否都能兼容,所以感觉还是提供一个比较通用的方式,能达到平台无关比较好。

这个说的对

sunce4t avatar Oct 21 '25 14:10 sunce4t

#include <fstream>

static int IsOthersGid(const std::string& device_name, int port_num, int gid_index) {
    std::string path = "/sys/class/infiniband/" + device_name +
                      "/ports/" + std::to_string(port_num) +
                      "/gids/" + std::to_string(gid_index);
#ifdef __linux__
    std::ifstream file(path);
    if (!file.is_open()) {
        return -1;
    }

    std::string line;
    if (!std::getline(file, line)) {
       return -2;
    }

    if (line == "0000:0000:0000:0000:0000:0000:0000:0000" ||
        line == "fe80:0000:0000:0000:0000:0000:0000:0000" ) {
        return 1;
    }
    return 0;
#else
    LOG(INFO) << "Fix of FindRdmaGid currently only supports linux";
    return 0;
#endif
}

static bool FindRdmaGid(ibv_context* context) {
    bool found = false;
    const char* device_name_cstr = IbvGetDeviceName(context->device);
    std::string device_name(device_name_cstr);
    for (int i = g_gid_tbl_len - 1; i >= 0; --i) {
        ibv_gid gid;
        if (IbvQueryGid(context, g_port_num, i, &gid) != 0) {
            continue;
        }
        if (gid.global.interface_id == 0) {
            continue;
        }
        if (FLAGS_rdma_gid_index == i) {
            g_gid = gid;
            g_gid_index = i;
            return true;
        }
	if(IsOthersGid(device_name, g_port_num, i) != 0) {
           continue;
        }
        // For infiniband, there is only one GID for each port.
        // For RoCE, there are 2 GIDs for each MAC and 2 GIDs for each IP.
        // Generally, the last GID is a RoCEv2-type GID generated by IP.
        if (!found) {
            g_gid = gid;
            g_gid_index = i;
            found = true;
        }
    }
    if (FLAGS_rdma_gid_index >= 0 && g_gid_index != FLAGS_rdma_gid_index) {
        found = false;
    }
    return found;
}

@yanglimingcn 最后我们确定这样使用了,我们的环境比较单一,都是linux 6.1.52-9内核,用的都是mellanox的卡,其他环境我们这边也没有所以没办法进行更多测试。 不过可以补充的是报错信息,在选取GID错误的那侧会报错:

fail to modify QP from INIT to RTR 以及 fail to bringup QP ....

在通信的另一侧则是报错:

fail to negotiate with xxx

sunce4t avatar Oct 24 '25 05:10 sunce4t