oneflow
oneflow copied to clipboard
OpGraph init cost too much
op graph init 是 pass 的主要开销
参见:https://github.com/Oneflow-Inc/libai/issues/407#issuecomment-1286776427
其内部各部分开销如下
Maybe<void> OpGraph::Init(const Job& job) {
auto cost_ct = std::make_unique<TimeCounter<std::chrono::milliseconds>>(true, true);
InitNodes(job);
cost_ct->Count("OpGraph0", 1);
op_name2op_node_.reserve(job.net().op_size());
ForEachNode([&](OpNode* node) {
CHECK(op_name2op_node_.emplace(node->op().op_name(), node).second)
<< "op_name: " << node->op().op_name();
});
cost_ct->Count("OpGraph1", 1);
InitEdges();
cost_ct->Count("OpGraph2", 1);
InitProducerOpName2CtrlConsumerOpNames(job);
cost_ct->Count("OpGraph3", 1);
CheckIsDAG();
cost_ct->Count("OpGraph4", 1);
ForEachNode([](OpNode* node) { node->InitLbi2SourceNode(); });
cost_ct->Count("OpGraph5", 1);
InferBlobLastUsed();
cost_ct->Count("OpGraph6", 1);
InferTimeShape();
cost_ct->Count("OpGraph7", 1);
{
LazyMode::Guard enable_lazy_mode_guard(true);
JUST(InferLogicalBlobDesc(job));
}
cost_ct->Count("OpGraph8", 1);
return Maybe<void>::Ok();
}
I20221022 00:52:34.335309 450187 time_util.h:97] [count log]{"loc":"OpGraph0","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"689 milliseconds"}
I20221022 00:52:34.343580 450187 time_util.h:97] [count log]{"loc":"OpGraph1","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"10 milliseconds"}
I20221022 00:52:34.520028 450187 time_util.h:97] [count log]{"loc":"OpGraph2","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"176 milliseconds"}
I20221022 00:52:34.521185 450187 time_util.h:97] [count log]{"loc":"OpGraph3","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"1 milliseconds"}
I20221022 00:52:34.653026 450187 time_util.h:97] [count log]{"loc":"OpGraph4","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"131 milliseconds"}
I20221022 00:52:34.668092 450187 time_util.h:97] [count log]{"loc":"OpGraph5","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"15 milliseconds"}
I20221022 00:52:34.722708 450187 time_util.h:97] [count log]{"loc":"OpGraph6","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"54 milliseconds"}
I20221022 00:52:34.758741 450187 time_util.h:97] [count log]{"loc":"OpGraph7","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"36 milliseconds"}
I20221022 00:52:37.151204 450187 time_util.h:97] [count log]{"loc":"OpGraph8","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"2392 milliseconds"}
I20221022 00:52:37.151316 450187 time_util.h:97] [count log]{"loc":"init op graph","mem_rss":"4431.000000 MB","mem_vm":"26211.000000 MB","time_cost":"3507 milliseconds"}
InferLogicalBlobDesc 是主要开销,内部做了 op 遍历和推导,各个推导步骤的开销均匀,没有很好的优化点。 op间也不能并行;
JUST(InferLogicalBlobDesc(job))
一些优化线索
- InferLogicalBlobDesc 并不是每个 pass 都需要的
- 各个 pass 复用同一个 op graph,修改 job 后,也修改 op graph,如此就个各个 pass 复用 op graph 了
- OpGraph 只记录图结构(Node 和 Edge),这样每次创建开销比较小
- 其它细节信息,还是在 job 中,按需推导
- 总体思路
- 一个pass 改完后,job 是正确的,op graph 按说也可以改成还是正确的
- 一个 pass 修改后,按说不会导致所有信息都有重新推导