oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

RankTaskGraph

Open lixinqi opened this issue 2 years ago • 4 comments

将TaskGraph的逻辑拆解成BoxingTaskGraph和RankTaskGraph。BoxingTaskGraph负责构建boxing相关的task graph子图,然后序列化到BoxingTaskGraphProto。RankTaskGraph负责两点:1)构建指定rank的CompTaskNode;2)从BoxingTaskGraphProto恢复属于boxing部分的子图; 分布式编译的大体过程将会是:

  1. 在main线程(或master进程)上由OpGraph构建BoxingTaskGraph,并序列化成BoxingTaskGraphProto;
  2. 在线程池里的各个worker线程(或worker进程)上由OpGraph/BoxingTaskGraphProto/rank等信息构建属于该rank的RankTaskGraph,然后生成该rank的plan。

本pr实现的是分离编译的中间状态版本:即BoxingTaskGraph在main线程上构建,而RankTaskGraph在线程池里构建。 后续pr再实现彻底的分离编译,即BoxingTaskGraph在master进程上构建,而RankTaskGraph在worker进程上构建。

lixinqi avatar Sep 19 '22 05:09 lixinqi

提供了环境变量供切换:

  1. ONEFLOW_LAZY_COMPILE_MODE=naive 旧版编译方式,全rank编译。
  2. ONEFLOW_LAZY_COMPILE_MODE=rank_per_thread 多线程分离编译,每个rank放在独立的线程里。
  3. ONEFLOW_LAZY_COMPILE_MODE=rank_per_iter 单线程分离编译,每个rank放在main线程的每次循环里。

如果多线程分离编译遇到bug,请回到单线程分离编译再跑一次。

lixinqi avatar Sep 23 '22 12:09 lixinqi

  • https://github.com/Oneflow-Inc/oneflow/pull/9108/commits/fa49459c99f2df912f68b8c7eabcad7bca40388b
  • https://github.com/Oneflow-Inc/libai/commit/6273c06b15f5499d881d45da4ec93218ba34b6f6
  • oneflow-25 & oneflow-28
  • t5 3d并行用例 t5_nl12_nah12_hs768_fp16_actrue_mp2_pp2_mb8_gb128_2n4g
  • export ONEFLOW_LAZY_COMPILE_MODE=rank_per_iter 报错 日志
F20221010 03:57:00.982542 1494408 task_graph.cpp:1204] Check failed: src->parallel_desc_sym() == dst->parallel_desc_sym()
*** Check failure stack trace: ***
    @     0x7f07f13a813a  google::LogMessage::Fail()
    @     0x7f07f13a8422  google::LogMessage::SendToLog()
    @     0x7f07f13a7ca7  google::LogMessage::Flush()
    @     0x7f07f13aa819  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f07e973280d  _ZNSt17_Function_handlerIFvPKN7oneflow6OpNodeES3_EZNS0_13RankTaskGraph4InitERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEEbEUlS3_S3_E8_E9_M_invokeERKSt9_Any_dataOS3_SK_
    @     0x7f07e972aa75  _ZZN7oneflow12_GLOBAL__N_131ForEachOpGraphNecessaryCtrlEdgeIXadL_ZNKS_7OpGraph30cached_predicator_is_reachableEvEEEEvPKS2_RKSt8functionIFvPKNS_6OpNodeES8_EEENKUlPS6_E_clESD_
    @     0x7f07e973d6d4  oneflow::RankTaskGraph::Init()
    @     0x7f07e98a8549  oneflow::RankCompiler::Compile()
    @     0x7f07e9005fb4  _ZZZZN7oneflow7NNGraph17MasterRankCompileIXadL_ZNS_16SingleThreadLoopEmRKSt8functionIFvmEEEEEENS_5MaybeIvvEEvENKUlvE_clEvENKUlmE0_clEmENKUlPKcE0_clESC_
    @     0x7f07e90096bb  _ZZZN7oneflow7NNGraph17MasterRankCompileIXadL_ZNS_16SingleThreadLoopEmRKSt8functionIFvmEEEEEENS_5MaybeIvvEEvENKUlvE_clEvENKUlmE0_clEm
    @     0x7f07ea7a1a99  oneflow::SingleThreadLoop()
    @     0x7f07e90059f4  _ZZN7oneflow7NNGraph17MasterRankCompileIXadL_ZNS_16SingleThreadLoopEmRKSt8functionIFvmEEEEEENS_5MaybeIvvEEvENKUlvE_clEv
    @     0x7f07e9005f29  _ZNSt17_Function_handlerIFN7oneflow5MaybeIvvEEvEZNS0_7NNGraph17MasterRankCompileIXadL_ZNS0_16SingleThreadLoopEmRKSt8functionIFvmEEEEEES2_vEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7f07e96f10f7  oneflow::OpGraph::WithSingleton()
    @     0x7f07e900651e  oneflow::NNGraph::MasterRankCompile<>()
    @     0x7f07e8ffb284  oneflow::NNGraph::CompileAndInitRuntime()
    @     0x7f08afbedc3c  (unknown)
    @     0x7f08afb39e69  (unknown)
    @     0x56140b77500e  cfunction_call_varargs
    @     0x56140b76a13f  _PyObject_MakeTpCall
    @     0x56140b79fca0  method_vectorcall
    @     0x56140b814923  _PyEval_EvalFrameDefault
    @     0x56140b8067e7  _PyFunction_Vectorcall
    @     0x56140b79fb2e  method_vectorcall
    @     0x56140b814923  _PyEval_EvalFrameDefault
    @     0x56140b805600  _PyEval_EvalCodeWithName
    @     0x56140b806bc4  _PyFunction_Vectorcall
    @     0x56140b79fbf8  method_vectorcall
    @     0x56140b7704a9  PyObject_Call
    @     0x56140b8118a7  _PyEval_EvalFrameDefault
    @     0x56140b8060ff  _PyEval_EvalCodeWithName
    @     0x56140b806bc4  _PyFunction_Vectorcall
F20221010 03:57:01.748999 1494545 ctrl_client.cpp:54] Check failed: rpc_client_.GetStubAt(i)->CallMethod<CtrlMethod::kLoadServer>( &client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0) Machine 0 lost
*** Check failure stack trace: ***
    @     0x7f6487caf13a  google::LogMessage::Fail()
    @     0x7f6487caf422  google::LogMessage::SendToLog()
    @     0x7f6487caeca7  google::LogMessage::Flush()
    @     0x7f6487cb1819  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f647c975be5  _ZZN7oneflow14GrpcCtrlClientC4ERKNS_10ProcessCtxEENKUlvE_clEv
    @     0x7f6487cc3b7f  execute_native_thread_routine
    @     0x7f65589c1609  start_thread
    @     0x7f65588e6133  clone
  • export ONEFLOW_LAZY_COMPILE_MODE=naive 可以正常运行

xyn1201 avatar Oct 10 '22 04:10 xyn1201

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] avatar Nov 22 '22 08:11 github-actions[bot]

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] avatar Nov 22 '22 08:11 github-actions[bot]