多容器场景下BRPC应用RDMA GID索引匹配错误修复
Describe the bug 目前,BRPC在选择RDMA设备的GID时,其流程是首先通过ibv_query_port获取GID表的大小(gid_tbl_len),进而通过ibv_query_gid从高索引到低索引反向遍历该表,并选择首个可用的GID。
问题在于,当容器需要RDMA能力时,需要ip link add新设备并配置ip,指定ipvlan master为主网卡eth0,系统会动态地向宿主机的RDMA设备GID表中添加新的条目。在单一宿主机上部署多个容器时,每个容器都会向同一张GID表添加其条目。
此时,若用户未显式指定gid_index,BRPC的通用遍历逻辑会扫描整个GID表。由于缺乏容器级别的隔离感知能力,该逻辑可能使一个容器错误地选中属于另一个容器的GID,导致在初始化时报错:
W1017 15:55:46.241584 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:1237 BringUpQp] Fail to modify QP from INIT to RTR: No such device
W1017 15:55:46.241602 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:520 ProcessHandshakeAtClient] Fail to bringup QP, fallback to tcp:brpc::Socket
不仅仅是容器内使用会报错,只要这台宿主机上有任一一个容器,在宿主机上使用BRPC时均会出现此问题。
To Reproduce
在任意一台宿主机上启动多个使用的容器,并执行rdma_performance下的测试程序即可。 如下,第一次启动一个容器,此时所用GID的index为5:
./client --servers=xxxxx:12000 --attachment_size=5000 --rdma_zerocopy_min_size=16 --rpc_timeout_ms=2000
I1017 15:52:13.284984 10684 0 src/brpc/rdma/rdma_helper.cpp:387 ReadRdmaDynamicLib] Successfully loaded libibverbs.so
I1017 15:52:13.291336 10684 0 src/brpc/rdma/rdma_helper.cpp:529 GlobalRdmaInitializeOrDieImpl] RDMA device: mlx5_0
I1017 15:52:13.291347 10684 0 src/brpc/rdma/rdma_helper.cpp:531 GlobalRdmaInitializeOrDieImpl] RDMA LID: 0
I1017 15:52:13.291438 10684 0 src/brpc/rdma/rdma_helper.cpp:536 GlobalRdmaInitializeOrDieImpl] RDMA GID Index: 5
I1017 15:52:13.292021 10684 0 src/brpc/rdma/block_pool.cpp:214 ExtendBlockPool] Start extend rdma memory 1024MB
I1017 15:52:14.061898 10684 0 src/brpc/server.cpp:1260 StartInternal] Server[DummyServerOf(./client)] is serving on port=8001.
[Threads: 1, Depth: 1, Attachment: 5000B, RDMA: yes, Echo: no]
Avg-Latency: 49, 90th-Latency: 52, 99th-Latency: 58, 99.9th-Latency: 63, Throughput: 94.1512MB/s, QPS: 19k, Server CPU-utilization: 53%, Client CPU-utilization: 21%
在此机器启动第二个容器,第二个容器中不做任何操作,继续在第一个容器执行测试程序,所用GID的index为7:
./client --servers=xxxxx:12000 --attachment_size=5000 --rdma_zerocopy_min_size=16 --rpc_timeout_ms=2000
I1017 15:54:50.501079 10728 0 src/brpc/rdma/rdma_helper.cpp:387 ReadRdmaDynamicLib] Successfully loaded libibverbs.so
I1017 15:54:50.507048 10728 0 src/brpc/rdma/rdma_helper.cpp:529 GlobalRdmaInitializeOrDieImpl] RDMA device: mlx5_0
I1017 15:54:50.507060 10728 0 src/brpc/rdma/rdma_helper.cpp:531 GlobalRdmaInitializeOrDieImpl] RDMA LID: 0
I1017 15:54:50.507147 10728 0 src/brpc/rdma/rdma_helper.cpp:536 GlobalRdmaInitializeOrDieImpl] RDMA GID Index: 7
I1017 15:54:50.507764 10728 0 src/brpc/rdma/block_pool.cpp:214 ExtendBlockPool] Start extend rdma memory 1024MB
I1017 15:54:51.261788 10728 0 src/brpc/server.cpp:1260 StartInternal] Server[DummyServerOf(./client)] is serving on port=8001.
[Threads: 1, Depth: 1, Attachment: 5000B, RDMA: yes, Echo: no]
W1017 15:54:51.266231 10731 4294967297 src/brpc/rdma/rdma_endpoint.cpp:1237 BringUpQp] Fail to modify QP from INIT to RTR: No such device
W1017 15:54:51.266251 10731 4294967297 src/brpc/rdma/rdma_endpoint.cpp:520 ProcessHandshakeAtClient] Fail to bringup QP, fallback to tcp:brpc::Socket
我们的宿主机上默认是有4个GID的,每次启动容器增加两个GID;因此第二次启动容器后的GID index=7也符合预期; 在此机器启动第三个容器,第三个容器中不做任何操作,继续在第一个容器执行测试程序,所用GID的index为9:
./client --servers=xxxxx:12000 --attachment_size=5000 --rdma_zerocopy_min_size=16 --rpc_timeout_ms=2000
I1017 15:55:45.475493 10759 0 src/brpc/rdma/rdma_helper.cpp:387 ReadRdmaDynamicLib] Successfully loaded libibverbs.so
I1017 15:55:45.481561 10759 0 src/brpc/rdma/rdma_helper.cpp:529 GlobalRdmaInitializeOrDieImpl] RDMA device: mlx5_0
I1017 15:55:45.481573 10759 0 src/brpc/rdma/rdma_helper.cpp:531 GlobalRdmaInitializeOrDieImpl] RDMA LID: 0
I1017 15:55:45.481661 10759 0 src/brpc/rdma/rdma_helper.cpp:536 GlobalRdmaInitializeOrDieImpl] RDMA GID Index: 9
I1017 15:55:45.482310 10759 0 src/brpc/rdma/block_pool.cpp:214 ExtendBlockPool] Start extend rdma memory 1024MB
I1017 15:55:46.237825 10759 0 src/brpc/server.cpp:1260 StartInternal] Server[DummyServerOf(./client)] is serving on port=8001.
[Threads: 1, Depth: 1, Attachment: 5000B, RDMA: yes, Echo: no]
W1017 15:55:46.241584 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:1237 BringUpQp] Fail to modify QP from INIT to RTR: No such device
W1017 15:55:46.241602 10770 4294969345 src/brpc/rdma/rdma_endpoint.cpp:520 ProcessHandshakeAtClient] Fail to bringup QP, fallback to tcp:brpc::Socket
Expected behavior BRPC选择GID时应当排除非本容器的GID,这可以通过
cat /sys/class/infiniband/{device_name}/ports/{port_num}/gids/{gid_index}
进行判断。这与通过ibv_query_gid获取GID不同,该目录下并不会出现其他容器的GID。
我们增加了一个patch以过滤非本容器的GID,如下:
diff --git a/src/brpc/rdma/rdma_helper.cpp b/src/brpc/rdma/rdma_helper.cpp
index 9bad3375..2106eaf2 100644
--- a/src/brpc/rdma/rdma_helper.cpp
+++ b/src/brpc/rdma/rdma_helper.cpp
@@ -21,6 +21,7 @@
#include <pthread.h>
#include <stdlib.h>
#include <vector>
+#include <fstream>
#include <gflags/gflags.h>
#include "butil/containers/flat_map.h" // butil::FlatMap
#include "butil/fd_guard.h"
@@ -216,6 +217,30 @@ static void FindRdmaLid() {
return;
}
+
+static int IsSelfGid(const std::string& device_name, int port_num, int gid_index) {
+ std::string path = "/sys/class/infiniband/" + device_name +
+ "/ports/" + std::to_string(port_num) +
+ "/gids/" + std::to_string(gid_index);
+
+ std::ifstream file(path);
+ if (!file.is_open()) {
+ return -1;
+ }
+
+ std::string line;
+ if (!std::getline(file, line)) {
+ return -2;
+ }
+
+ if (line == "0000:0000:0000:0000:0000:0000:0000:0000" ||
+ line == "::" ||
+ line == "0000:0000:0000:0000:0000:ffff:0000:0000" ) {
+ return 1;
+ }
+ return 0;
+}
+
static bool FindRdmaGid(ibv_context* context) {
bool found = false;
for (int i = g_gid_tbl_len - 1; i >= 0; --i) {
@@ -223,14 +248,23 @@ static bool FindRdmaGid(ibv_context* context) {
if (IbvQueryGid(context, g_port_num, i, &gid) != 0) {
continue;
}
+
if (gid.global.interface_id == 0) {
continue;
}
+
if (FLAGS_rdma_gid_index == i) {
g_gid = gid;
g_gid_index = i;
return true;
}
+
+ const char* device_name_cstr = IbvGetDeviceName(context->device);
+ std::string device_name(device_name_cstr);
+ if(IsSelfGid(device_name, g_port_num, i) != 0) {
+ continue;
+ }
+
Versions OS: linux 6.1.52-9 Compiler: 与编译器无关 brpc: commit id 为 7229c3608f8cb98b24a0a2e7f99bc01d357d9312 protobuf: 与protobuf无关
Additional context/screenshots
FLAGS_rdma_device 指定能否解决这个问题呢? 如果不能的话,可以定义一个rdma_device_gid_filter函数,用于用户实现过滤规则,IsSelfGid这个规则对于你所在的环境是没问题的,但是其它的环境不清楚是否适用。
FLAGS_rdma_device 指定能否解决这个问题呢? 如果不能的话,可以定义一个rdma_device_gid_filter函数,用于用户实现过滤规则,IsSelfGid这个规则对于你所在的环境是没问题的,但是其它的环境不清楚是否适用。
@yanglimingcn hi,这个只需要指定FLAGS_rdma_gid_index 就能解决,不过这样就要求使用前必须先使用show_gids去查看本容器的gid,对业务来讲这会麻烦一些; IsSelfGid就是实现了show_gids的逻辑,如下是show_gids的脚本,其实也是查看 /sys/class/infiniband下的子目录确定有哪些gid的,所以我感觉差不多。
其它的环境 想问下是指什么呢
cat /usr/sbin/show_gids
#!/bin/bash
#
# Copyright (c) 2016 Mellanox Technologies. All rights reserved.
#
# This Software is licensed under one of the following licenses:
#
# 1) under the terms of the "Common Public License 1.0" a copy of which is
# available from the Open Source Initiative, see
# http://www.opensource.org/licenses/cpl.php.
#
# 2) under the terms of the "The BSD License" a copy of which is
# available from the Open Source Initiative, see
# http://www.opensource.org/licenses/bsd-license.php.
#
# 3) under the terms of the "GNU General Public License (GPL) Version 2" a
# copy of which is available from the Open Source Initiative, see
# http://www.opensource.org/licenses/gpl-license.php.
#
# Licensee has the right to choose one of the above licenses.
#
# Redistributions of source code must retain the above copyright
# notice and one of the license notices.
#
# Redistributions in binary form must reproduce both the above copyright
# notice, one of the license notices in the documentation
# and/or other materials provided with the distribution.
#
# Author: Moni Shoua <[email protected]>
#
black='\E[30;50m'
red='\E[31;50m'
green='\E[32;50m'
yellow='\E[33;50m'
blue='\E[34;50m'
magenta='\E[35;50m'
cyan='\E[36;50m'
white='\E[37;50m'
bold='\033[1m'
gid_count=0
# cecho (color echo) prints text in color.
# first parameter should be the desired color followed by text
function cecho ()
{
echo -en $1
shift
echo -n $*
tput sgr0
}
# becho (color echo) prints text in bold.
becho ()
{
echo -en $bold
echo -n $*
tput sgr0
}
function print_gids()
{
dev=$1
port=$2
for gf in /sys/class/infiniband/$dev/ports/$port/gids/* ; do
gid=$(cat $gf);
if [ $gid = 0000:0000:0000:0000:0000:0000:0000:0000 ] ; then
continue
fi
echo -e $(basename $gf) "\t" $gid
done
}
echo -e "DEV\tPORT\tINDEX\tGID\t\t\t\t\tIPv4 \t\tVER\tDEV"
echo -e "---\t----\t-----\t---\t\t\t\t\t------------ \t---\t---"
DEVS=$1
if [ -z "$DEVS" ] ; then
DEVS=$(ls /sys/class/infiniband/)
fi
for d in $DEVS ; do
for p in $(ls /sys/class/infiniband/$d/ports/) ; do
for g in $(ls /sys/class/infiniband/$d/ports/$p/gids/) ; do
gid=$(cat /sys/class/infiniband/$d/ports/$p/gids/$g);
if [ $gid = 0000:0000:0000:0000:0000:0000:0000:0000 ] ; then
continue
fi
if [ $gid = fe80:0000:0000:0000:0000:0000:0000:0000 ] ; then
continue
fi
_ndev=$(cat /sys/class/infiniband/$d/ports/$p/gid_attrs/ndevs/$g 2>/dev/null)
__type=$(cat /sys/class/infiniband/$d/ports/$p/gid_attrs/types/$g 2>/dev/null)
_type=$(echo $__type| grep -o "[Vv].*")
if [ $(echo $gid | cut -d ":" -f -1) = "0000" ] ; then
ipv4=$(printf "%d.%d.%d.%d" 0x${gid:30:2} 0x${gid:32:2} 0x${gid:35:2} 0x${gid:37:2})
echo -e "$d\t$p\t$g\t$gid\t$ipv4 \t$_type\t$_ndev"
else
echo -e "$d\t$p\t$g\t$gid\t\t\t$_type\t$_ndev"
fi
gid_count=$(expr 1 + $gid_count)
done #g (gid)
done #p (port)
done #d (dev)
echo n_gids_found=$gid_count
嗯,感觉写到一个启动脚本里面也可以。我说的其它环境是指非linux环境,比如mac,windows,或者linux环境也有很多,不确定这个处理逻辑是否都能兼容,所以感觉还是提供一个比较通用的方式,能达到平台无关比较好。
嗯,感觉写到一个启动脚本里面也可以。我说的其它环境是指非linux环境,比如mac,windows,或者linux环境也有很多,不确定这个处理逻辑是否都能兼容,所以感觉还是提供一个比较通用的方式,能达到平台无关比较好。
这个说的对
#include <fstream>
static int IsOthersGid(const std::string& device_name, int port_num, int gid_index) {
std::string path = "/sys/class/infiniband/" + device_name +
"/ports/" + std::to_string(port_num) +
"/gids/" + std::to_string(gid_index);
#ifdef __linux__
std::ifstream file(path);
if (!file.is_open()) {
return -1;
}
std::string line;
if (!std::getline(file, line)) {
return -2;
}
if (line == "0000:0000:0000:0000:0000:0000:0000:0000" ||
line == "fe80:0000:0000:0000:0000:0000:0000:0000" ) {
return 1;
}
return 0;
#else
LOG(INFO) << "Fix of FindRdmaGid currently only supports linux";
return 0;
#endif
}
static bool FindRdmaGid(ibv_context* context) {
bool found = false;
const char* device_name_cstr = IbvGetDeviceName(context->device);
std::string device_name(device_name_cstr);
for (int i = g_gid_tbl_len - 1; i >= 0; --i) {
ibv_gid gid;
if (IbvQueryGid(context, g_port_num, i, &gid) != 0) {
continue;
}
if (gid.global.interface_id == 0) {
continue;
}
if (FLAGS_rdma_gid_index == i) {
g_gid = gid;
g_gid_index = i;
return true;
}
if(IsOthersGid(device_name, g_port_num, i) != 0) {
continue;
}
// For infiniband, there is only one GID for each port.
// For RoCE, there are 2 GIDs for each MAC and 2 GIDs for each IP.
// Generally, the last GID is a RoCEv2-type GID generated by IP.
if (!found) {
g_gid = gid;
g_gid_index = i;
found = true;
}
}
if (FLAGS_rdma_gid_index >= 0 && g_gid_index != FLAGS_rdma_gid_index) {
found = false;
}
return found;
}
@yanglimingcn 最后我们确定这样使用了,我们的环境比较单一,都是linux 6.1.52-9内核,用的都是mellanox的卡,其他环境我们这边也没有所以没办法进行更多测试。 不过可以补充的是报错信息,在选取GID错误的那侧会报错:
fail to modify QP from INIT to RTR 以及 fail to bringup QP ....
在通信的另一侧则是报错:
fail to negotiate with xxx