oneflow
oneflow copied to clipboard
add master broadcast perf tool
构造一个 CtrlClient 同步数据的模拟工具,用于评估master向worker广播的性能。
update: 以下问题已解决
开启3个进程测试功能: 发现仍存在问题待解决:
W20230327 06:15:53.597208 2091708 rank_info_bootstrap_server.cpp:103] BootstrapServer not ready, rpc server on some rank have not been created successfully. Failed at 2 times, total ranks(world_size): 3, ready ranks: 2
尝试解决:
- 多次执行发现个别尝试能够3个进程都 ready,但大多数时候不行,考虑重试次数太少
DEFINE_ENV_INTEGER(ONEFLOW_RPC_BOOTSTRAP_SERVER_MAX_RETRY_TIMES, 3);
DEFINE_ENV_INTEGER(ONEFLOW_RPC_CLIENT_MAX_RETRY_TIMES, 6);
将上两个重试次数更改为1000次,依旧失败.
- 在链路加上大量log,排查原因 上述问题出现在 Boostrap 阶段
CHECK_JUST(Singleton<RpcManager>::Get()->CreateServer());
LOG(WARNING) << "rank " << rank << " create server success";
CHECK_JUST(Singleton<RpcManager>::Get()->Bootstrap());
LOG(WARNING) << "rank " << rank << " bootstrap success";
CHECK_JUST(Singleton<RpcManager>::Get()->CreateClient());
LOG(WARNING) << "rank " << rank << " create client success";
进一步查找是在 LoadServer 阶段: 于是在 RPC server中加上log:
Add([this](CtrlCall<CtrlMethod::kLoadServer>* call) {
LOG(WARNING) << "RPC Server receive KLoadServer call"; // 增加log
OnLoadServer(call);
});
在 OnLoadServer中加入log:
void RankInfoBootstrapServer::OnLoadServer(CtrlCall<CtrlMethod::kLoadServer>* call) {
int64_t rank = call->request().rank();
CHECK_GE(rank, 0);
CHECK_LT(rank, world_size_);
if (!rank2host_) { rank2host_ = std::make_shared<std::vector<std::string>>(world_size_); }
std::lock_guard<std::mutex> lock(lock_);
rank2host_->at(rank) = GetHostFromUri(call->server_ctx().peer());
LOG(WARNING) << "rank " << rank << " 's url: " << call->server_ctx().peer(); // 增加log,表示rank load ready
call->SendResponse();
EnqueueRequest<CtrlMethod::kLoadServer>();
}
再在RPC client中加入log:
void RpcClient::LoadServer(const LoadServerRequest& request, CtrlService::Stub* stub) {
int32_t retry_idx = 0;
int32_t skip_warning_times = 3;
for (; retry_idx < rpc_client_max_retry_times(); ++retry_idx) {
grpc::ClientContext client_ctx;
LoadServerResponse response;
LOG(WARNING) << "rank " << request.rank() << " call LoadServer to " << request.addr()
<< " at retry " << retry_idx; // 加入log
grpc::Status st = stub->CallMethod<CtrlMethod::kLoadServer>(&client_ctx, request, &response);
if (st.error_code() == grpc::StatusCode::OK) {
VLOG(3) << "LoadServer " << request.addr() << " Successful at " << retry_idx + 1 << " times";
break;
} else if (st.error_code() == grpc::StatusCode::UNAVAILABLE) {
if (retry_idx >= skip_warning_times) {
LOG(WARNING) << "LoadServer " << request.addr() << " Failed at " << retry_idx + 1
<< " times"
<< " error_code: " << st.error_code()
<< " error_message: " << st.error_message();
}
std::this_thread::sleep_for(std::chrono::seconds(rpc_client_sleep_seconds()));
continue;
} else {
LOG(FATAL) << st.error_message();
}
}
LOG(WARNING) << "rank " << request.rank() << " ends LoadServer loop"; // 加入log,表示退出重试循环
CHECK_LT(retry_idx, rpc_client_max_retry_times());
}
运行程序得到log 在master中:
W20230327 06:15:13.594259 2091639 env_global_objects_scope.cpp:176] rank 0 create server success
W20230327 06:15:13.595381 2091639 rpc_client.cpp:196] rank 0 call LoadServer to 127.0.0.1 at retry 0
W20230327 06:15:13.596731 2091707 rpc_server.cpp:67] RPC Server receive KLoadServer call from rank 0
W20230327 06:15:13.596812 2091707 rank_info_bootstrap_server.cpp:130] rank 0 's url: ipv4:127.0.0.1:34044
W20230327 06:15:13.597214 2091639 rpc_client.cpp:216] rank 0 ends LoadServer loop
W20230327 06:15:14.802946 2091707 rpc_server.cpp:67] RPC Server receive KLoadServer call from rank 2
W20230327 06:15:14.803052 2091707 rank_info_bootstrap_server.cpp:130] rank 2 's url: ipv4:127.0.0.1:57382
W20230327 06:15:15.993507 2091706 rpc_server.cpp:67] RPC Server receive KLoadServer call from rank 1
W20230327 06:15:53.597208 2091708 rank_info_bootstrap_server.cpp:103] BootstrapServer not ready, rpc server on some rank have not been created successfully. Failed at 2 times, total ranks(world_size): 3, ready ranks: 2
W20230327 06:16:13.597550 2091708 rank_info_bootstrap_server.cpp:103] BootstrapServer not ready, rpc server on some rank have not been created successfully. Failed at 3 times, total ranks(world_size): 3, ready ranks: 2
在worker1中:
W20230327 06:15:15.991744 2091981 env_global_objects_scope.cpp:176] rank 1 create server success
W20230327 06:15:15.992609 2091981 rpc_client.cpp:196] rank 1 call LoadServer to 127.0.0.1 at retry 0
W20230327 06:15:15.993794 2091981 rpc_client.cpp:216] rank 1 ends LoadServer loop
在worker2中:
W20230327 06:15:14.799631 2091909 env_global_objects_scope.cpp:176] rank 2 create server success
W20230327 06:15:14.801512 2091909 rpc_client.cpp:196] rank 2 call LoadServer to 127.0.0.1 at retry 0
W20230327 06:15:14.803706 2091909 rpc_client.cpp:216] rank 2 ends LoadServer loop
分析log内容:
首选在master中,可以看到master收到了3次RPC Server receive KLoadServer call
,但只执行了2次,即只load了rank0和rank2,rank1并没有被load,
再看worker中,rank1和rank2都结束了LoadServer loop
,也就是grpc返回了OK
这样看的话,应该不是客户端中出现的问题,而是server收到了请求,但并没有处理