oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

add master broadcast perf tool

Open Pang-GJ opened this issue 1 year ago • 1 comments

构造一个 CtrlClient 同步数据的模拟工具,用于评估master向worker广播的性能。

update: 以下问题已解决

开启3个进程测试功能: 发现仍存在问题待解决:

W20230327 06:15:53.597208 2091708 rank_info_bootstrap_server.cpp:103] BootstrapServer not ready, rpc server on some rank have not been created successfully. Failed at 2 times, total ranks(world_size): 3, ready ranks: 2

尝试解决:

  1. 多次执行发现个别尝试能够3个进程都 ready,但大多数时候不行,考虑重试次数太少
DEFINE_ENV_INTEGER(ONEFLOW_RPC_BOOTSTRAP_SERVER_MAX_RETRY_TIMES, 3);
DEFINE_ENV_INTEGER(ONEFLOW_RPC_CLIENT_MAX_RETRY_TIMES, 6);

将上两个重试次数更改为1000次,依旧失败.

  1. 在链路加上大量log,排查原因 上述问题出现在 Boostrap 阶段
  CHECK_JUST(Singleton<RpcManager>::Get()->CreateServer());
  LOG(WARNING) << "rank " << rank << " create server success";
  CHECK_JUST(Singleton<RpcManager>::Get()->Bootstrap());
  LOG(WARNING) << "rank " << rank << " bootstrap success";
  CHECK_JUST(Singleton<RpcManager>::Get()->CreateClient());
  LOG(WARNING) << "rank " << rank << " create client success";

进一步查找是在 LoadServer 阶段: 于是在 RPC server中加上log:

Add([this](CtrlCall<CtrlMethod::kLoadServer>* call) {
    LOG(WARNING) << "RPC Server receive KLoadServer call"; // 增加log
    OnLoadServer(call);
  });

在 OnLoadServer中加入log:

void RankInfoBootstrapServer::OnLoadServer(CtrlCall<CtrlMethod::kLoadServer>* call) {
  int64_t rank = call->request().rank();
  CHECK_GE(rank, 0);
  CHECK_LT(rank, world_size_);
  if (!rank2host_) { rank2host_ = std::make_shared<std::vector<std::string>>(world_size_); }
  std::lock_guard<std::mutex> lock(lock_);
  rank2host_->at(rank) = GetHostFromUri(call->server_ctx().peer());
  LOG(WARNING) << "rank " << rank << " 's url: " << call->server_ctx().peer(); // 增加log,表示rank load ready
  call->SendResponse();
  EnqueueRequest<CtrlMethod::kLoadServer>();
}

再在RPC client中加入log:

void RpcClient::LoadServer(const LoadServerRequest& request, CtrlService::Stub* stub) {
  int32_t retry_idx = 0;
  int32_t skip_warning_times = 3;
  for (; retry_idx < rpc_client_max_retry_times(); ++retry_idx) {
    grpc::ClientContext client_ctx;
    LoadServerResponse response;
    LOG(WARNING) << "rank " << request.rank() << " call LoadServer to " << request.addr()
                 << " at retry " << retry_idx;  // 加入log
    grpc::Status st = stub->CallMethod<CtrlMethod::kLoadServer>(&client_ctx, request, &response);
    if (st.error_code() == grpc::StatusCode::OK) {
      VLOG(3) << "LoadServer " << request.addr() << " Successful at " << retry_idx + 1 << " times";
      break;
    } else if (st.error_code() == grpc::StatusCode::UNAVAILABLE) {
      if (retry_idx >= skip_warning_times) {
        LOG(WARNING) << "LoadServer " << request.addr() << " Failed at " << retry_idx + 1
                     << " times"
                     << " error_code: " << st.error_code()
                     << " error_message: " << st.error_message();
      }
      std::this_thread::sleep_for(std::chrono::seconds(rpc_client_sleep_seconds()));
      continue;
    } else {
      LOG(FATAL) << st.error_message();
    }
  }
  LOG(WARNING) << "rank " << request.rank() << " ends LoadServer loop";  // 加入log,表示退出重试循环
  CHECK_LT(retry_idx, rpc_client_max_retry_times());
}

运行程序得到log 在master中:

W20230327 06:15:13.594259 2091639 env_global_objects_scope.cpp:176] rank 0 create server success
W20230327 06:15:13.595381 2091639 rpc_client.cpp:196] rank 0 call LoadServer to 127.0.0.1 at retry 0
W20230327 06:15:13.596731 2091707 rpc_server.cpp:67] RPC Server receive KLoadServer call from rank 0
W20230327 06:15:13.596812 2091707 rank_info_bootstrap_server.cpp:130] rank 0 's url: ipv4:127.0.0.1:34044
W20230327 06:15:13.597214 2091639 rpc_client.cpp:216] rank 0 ends LoadServer loop
W20230327 06:15:14.802946 2091707 rpc_server.cpp:67] RPC Server receive KLoadServer call from rank 2
W20230327 06:15:14.803052 2091707 rank_info_bootstrap_server.cpp:130] rank 2 's url: ipv4:127.0.0.1:57382
W20230327 06:15:15.993507 2091706 rpc_server.cpp:67] RPC Server receive KLoadServer call from rank 1
W20230327 06:15:53.597208 2091708 rank_info_bootstrap_server.cpp:103] BootstrapServer not ready, rpc server on some rank have not been created successfully. Failed at 2 times, total ranks(world_size): 3, ready ranks: 2
W20230327 06:16:13.597550 2091708 rank_info_bootstrap_server.cpp:103] BootstrapServer not ready, rpc server on some rank have not been created successfully. Failed at 3 times, total ranks(world_size): 3, ready ranks: 2

在worker1中:

W20230327 06:15:15.991744 2091981 env_global_objects_scope.cpp:176] rank 1 create server success
W20230327 06:15:15.992609 2091981 rpc_client.cpp:196] rank 1 call LoadServer to 127.0.0.1 at retry 0
W20230327 06:15:15.993794 2091981 rpc_client.cpp:216] rank 1 ends LoadServer loop

在worker2中:

W20230327 06:15:14.799631 2091909 env_global_objects_scope.cpp:176] rank 2 create server success
W20230327 06:15:14.801512 2091909 rpc_client.cpp:196] rank 2 call LoadServer to 127.0.0.1 at retry 0
W20230327 06:15:14.803706 2091909 rpc_client.cpp:216] rank 2 ends LoadServer loop

分析log内容: 首选在master中,可以看到master收到了3次RPC Server receive KLoadServer call,但只执行了2次,即只load了rank0和rank2,rank1并没有被load, 再看worker中,rank1和rank2都结束了LoadServer loop ,也就是grpc返回了OK 这样看的话,应该不是客户端中出现的问题,而是server收到了请求,但并没有处理

Pang-GJ avatar Mar 27 '23 06:03 Pang-GJ