oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

OpGraph init cost too much

Open strint opened this issue 3 years ago • 0 comments

op graph init 是 pass 的主要开销

参见:https://github.com/Oneflow-Inc/libai/issues/407#issuecomment-1286776427

其内部各部分开销如下

Maybe<void> OpGraph::Init(const Job& job) {
  auto cost_ct = std::make_unique<TimeCounter<std::chrono::milliseconds>>(true, true);
  InitNodes(job);
  cost_ct->Count("OpGraph0", 1);
  op_name2op_node_.reserve(job.net().op_size());
  ForEachNode([&](OpNode* node) {
    CHECK(op_name2op_node_.emplace(node->op().op_name(), node).second)
        << "op_name: " << node->op().op_name();
  });
  cost_ct->Count("OpGraph1", 1);
  InitEdges();
  cost_ct->Count("OpGraph2", 1);
  InitProducerOpName2CtrlConsumerOpNames(job);
  cost_ct->Count("OpGraph3", 1);
  CheckIsDAG();
  cost_ct->Count("OpGraph4", 1);
  ForEachNode([](OpNode* node) { node->InitLbi2SourceNode(); });
  cost_ct->Count("OpGraph5", 1);
  InferBlobLastUsed();
  cost_ct->Count("OpGraph6", 1);
  InferTimeShape();
  cost_ct->Count("OpGraph7", 1);
  {
    LazyMode::Guard enable_lazy_mode_guard(true);
    JUST(InferLogicalBlobDesc(job));
  }
  cost_ct->Count("OpGraph8", 1);
  return Maybe<void>::Ok();
}
I20221022 00:52:34.335309 450187 time_util.h:97] [count log]{"loc":"OpGraph0","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"689 milliseconds"}
I20221022 00:52:34.343580 450187 time_util.h:97] [count log]{"loc":"OpGraph1","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"10 milliseconds"}
I20221022 00:52:34.520028 450187 time_util.h:97] [count log]{"loc":"OpGraph2","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"176 milliseconds"}
I20221022 00:52:34.521185 450187 time_util.h:97] [count log]{"loc":"OpGraph3","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"1 milliseconds"}
I20221022 00:52:34.653026 450187 time_util.h:97] [count log]{"loc":"OpGraph4","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"131 milliseconds"}
I20221022 00:52:34.668092 450187 time_util.h:97] [count log]{"loc":"OpGraph5","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"15 milliseconds"}
I20221022 00:52:34.722708 450187 time_util.h:97] [count log]{"loc":"OpGraph6","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"54 milliseconds"}
I20221022 00:52:34.758741 450187 time_util.h:97] [count log]{"loc":"OpGraph7","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"36 milliseconds"}
I20221022 00:52:37.151204 450187 time_util.h:97] [count log]{"loc":"OpGraph8","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"2392 milliseconds"}
I20221022 00:52:37.151316 450187 time_util.h:97] [count log]{"loc":"init op graph","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"3507 milliseconds"}

InferLogicalBlobDesc 是主要开销,内部做了 op 遍历和推导,各个推导步骤的开销均匀,没有很好的优化点。 op间也不能并行;

    JUST(InferLogicalBlobDesc(job))

一些优化线索

  • InferLogicalBlobDesc 并不是每个 pass 都需要的
  • 各个 pass 复用同一个 op graph,修改 job 后,也修改 op graph,如此就个各个 pass 复用 op graph 了
    • OpGraph 只记录图结构(Node 和 Edge),这样每次创建开销比较小
    • 其它细节信息,还是在 job 中,按需推导
  • 总体思路
    • 一个pass 改完后,job 是正确的,op graph 按说也可以改成还是正确的
    • 一个 pass 修改后,按说不会导致所有信息都有重新推导

strint avatar Oct 22 '22 09:10 strint