brpc [E1008]Reached timeout=10000ms

Describe the bug (描述bug) brpc客户端在连接空闲了一段时间后，大约1.5h，再次请求发现会会报这样的错误 [E1008]Reached timeout=10000ms，大约再过了15分钟后，问题不再出现。

To Reproduce (复现方法) brpc客户端在连接空闲了一段时间后，大约1.5h，再次请求发现会会报这样的错误 [E1008]Reached timeout=10000ms

Expected behavior (期望行为) 正常请求，不报E1008错误

Versions (各种版本) OS: centos7.9 Compiler: gcc 9.3.1 brpc: - protobuf: -

Additional context/screenshots (更多上下文/截图)

Client::Client() {
  auto config = ConfigManager::GetConfigManager();

  brpc::ChannelOptions options;

  options.protocol = config->GetConfigParam<std::string>(
      "risk_control", "protocol", "baidu_std");
  options.connection_type = config->GetConfigParam<std::string>(
      "risk_control", "connection_type", "");
  options.timeout_ms =
      config->GetConfigParam<int>("risk_control", "timeout_ms", "100");
  options.max_retry =
      config->GetConfigParam<int>("risk_control", "max_retry", "3");

  const std::string server = config->GetConfigParam<std::string>(
      "risk_control", "server", "0.0.0.0:8000");
  const std::string load_balancer =
      config->GetConfigParam<std::string>("risk_control", "load_balancer", "");

  if (channel_.Init(server.c_str(), load_balancer.c_str(), &options) != 0) {
    SLOG_ERROR("", "RiskControl", "Client Init failed", 0, 0);
  }
  stub_.reset(new rc::AsyncRcService_Stub(&channel_));
}

初步分析了下，应该是tcp连接假死导致应用层没有感知到，触发了TCP的超时重传。

通过lsof命令查看客户端和服务端的TCP连接都是established的状态，所以比较疑惑：

什么场景下会出现tcp假死呢?
tcp假死如何快速解脱呢?

Jul 09 '24 05:07 clee01

问题有解决吗？我这也出现了这个报错。场景是：

brpc server在节点A
brpc client 在节点B，定时向节点A发送请求，接收回复操作：
节点B ifconfig ethX down（ethX是brpc连接的网络的网卡）
等待30s左右，ifconfig ethX up，

期望结果：网卡up后，1s左右能够正常brpc通信实际结果：有一半的概率网卡up后的8-15s会出现“[E1008]Reached timeout=”，实际brpc server是正常的

Sep 07 '24 07:09 cfycyf

问题有解决吗？我这也出现了这个报错。场景是：

brpc server在节点A

brpc client 在节点B，定时向节点A发送请求，接收回复操作：

节点B ifconfig ethX down（ethX是brpc连接的网络的网卡）

等待30s左右，ifconfig ethX up，

期望结果：网卡up后，1s左右能够正常brpc通信实际结果：有一半的概率网卡up后的8-15s会出现“[E1008]Reached timeout=”，实际brpc server是正常的

可以试试熔断策略。

Sep 13 '24 02:09 chenBright