oneflow
oneflow copied to clipboard
RankTaskGraph
将TaskGraph的逻辑拆解成BoxingTaskGraph和RankTaskGraph。BoxingTaskGraph负责构建boxing相关的task graph子图,然后序列化到BoxingTaskGraphProto。RankTaskGraph负责两点:1)构建指定rank的CompTaskNode;2)从BoxingTaskGraphProto恢复属于boxing部分的子图; 分布式编译的大体过程将会是:
- 在main线程(或master进程)上由OpGraph构建BoxingTaskGraph,并序列化成BoxingTaskGraphProto;
- 在线程池里的各个worker线程(或worker进程)上由OpGraph/BoxingTaskGraphProto/rank等信息构建属于该rank的RankTaskGraph,然后生成该rank的plan。
本pr实现的是分离编译的中间状态版本:即BoxingTaskGraph在main线程上构建,而RankTaskGraph在线程池里构建。 后续pr再实现彻底的分离编译,即BoxingTaskGraph在master进程上构建,而RankTaskGraph在worker进程上构建。
提供了环境变量供切换:
- ONEFLOW_LAZY_COMPILE_MODE=naive 旧版编译方式,全rank编译。
- ONEFLOW_LAZY_COMPILE_MODE=rank_per_thread 多线程分离编译,每个rank放在独立的线程里。
- ONEFLOW_LAZY_COMPILE_MODE=rank_per_iter 单线程分离编译,每个rank放在main线程的每次循环里。
如果多线程分离编译遇到bug,请回到单线程分离编译再跑一次。
- https://github.com/Oneflow-Inc/oneflow/pull/9108/commits/fa49459c99f2df912f68b8c7eabcad7bca40388b
- https://github.com/Oneflow-Inc/libai/commit/6273c06b15f5499d881d45da4ec93218ba34b6f6
- oneflow-25 & oneflow-28
- t5 3d并行用例
t5_nl12_nah12_hs768_fp16_actrue_mp2_pp2_mb8_gb128_2n4g
-
export ONEFLOW_LAZY_COMPILE_MODE=rank_per_iter
报错 日志
F20221010 03:57:00.982542 1494408 task_graph.cpp:1204] Check failed: src->parallel_desc_sym() == dst->parallel_desc_sym()
*** Check failure stack trace: ***
@ 0x7f07f13a813a google::LogMessage::Fail()
@ 0x7f07f13a8422 google::LogMessage::SendToLog()
@ 0x7f07f13a7ca7 google::LogMessage::Flush()
@ 0x7f07f13aa819 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f07e973280d _ZNSt17_Function_handlerIFvPKN7oneflow6OpNodeES3_EZNS0_13RankTaskGraph4InitERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEEbEUlS3_S3_E8_E9_M_invokeERKSt9_Any_dataOS3_SK_
@ 0x7f07e972aa75 _ZZN7oneflow12_GLOBAL__N_131ForEachOpGraphNecessaryCtrlEdgeIXadL_ZNKS_7OpGraph30cached_predicator_is_reachableEvEEEEvPKS2_RKSt8functionIFvPKNS_6OpNodeES8_EEENKUlPS6_E_clESD_
@ 0x7f07e973d6d4 oneflow::RankTaskGraph::Init()
@ 0x7f07e98a8549 oneflow::RankCompiler::Compile()
@ 0x7f07e9005fb4 _ZZZZN7oneflow7NNGraph17MasterRankCompileIXadL_ZNS_16SingleThreadLoopEmRKSt8functionIFvmEEEEEENS_5MaybeIvvEEvENKUlvE_clEvENKUlmE0_clEmENKUlPKcE0_clESC_
@ 0x7f07e90096bb _ZZZN7oneflow7NNGraph17MasterRankCompileIXadL_ZNS_16SingleThreadLoopEmRKSt8functionIFvmEEEEEENS_5MaybeIvvEEvENKUlvE_clEvENKUlmE0_clEm
@ 0x7f07ea7a1a99 oneflow::SingleThreadLoop()
@ 0x7f07e90059f4 _ZZN7oneflow7NNGraph17MasterRankCompileIXadL_ZNS_16SingleThreadLoopEmRKSt8functionIFvmEEEEEENS_5MaybeIvvEEvENKUlvE_clEv
@ 0x7f07e9005f29 _ZNSt17_Function_handlerIFN7oneflow5MaybeIvvEEvEZNS0_7NNGraph17MasterRankCompileIXadL_ZNS0_16SingleThreadLoopEmRKSt8functionIFvmEEEEEES2_vEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7f07e96f10f7 oneflow::OpGraph::WithSingleton()
@ 0x7f07e900651e oneflow::NNGraph::MasterRankCompile<>()
@ 0x7f07e8ffb284 oneflow::NNGraph::CompileAndInitRuntime()
@ 0x7f08afbedc3c (unknown)
@ 0x7f08afb39e69 (unknown)
@ 0x56140b77500e cfunction_call_varargs
@ 0x56140b76a13f _PyObject_MakeTpCall
@ 0x56140b79fca0 method_vectorcall
@ 0x56140b814923 _PyEval_EvalFrameDefault
@ 0x56140b8067e7 _PyFunction_Vectorcall
@ 0x56140b79fb2e method_vectorcall
@ 0x56140b814923 _PyEval_EvalFrameDefault
@ 0x56140b805600 _PyEval_EvalCodeWithName
@ 0x56140b806bc4 _PyFunction_Vectorcall
@ 0x56140b79fbf8 method_vectorcall
@ 0x56140b7704a9 PyObject_Call
@ 0x56140b8118a7 _PyEval_EvalFrameDefault
@ 0x56140b8060ff _PyEval_EvalCodeWithName
@ 0x56140b806bc4 _PyFunction_Vectorcall
F20221010 03:57:01.748999 1494545 ctrl_client.cpp:54] Check failed: rpc_client_.GetStubAt(i)->CallMethod<CtrlMethod::kLoadServer>( &client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0) Machine 0 lost
*** Check failure stack trace: ***
@ 0x7f6487caf13a google::LogMessage::Fail()
@ 0x7f6487caf422 google::LogMessage::SendToLog()
@ 0x7f6487caeca7 google::LogMessage::Flush()
@ 0x7f6487cb1819 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f647c975be5 _ZZN7oneflow14GrpcCtrlClientC4ERKNS_10ProcessCtxEENKUlvE_clEv
@ 0x7f6487cc3b7f execute_native_thread_routine
@ 0x7f65589c1609 start_thread
@ 0x7f65588e6133 clone
-
export ONEFLOW_LAZY_COMPILE_MODE=naive
可以正常运行
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.