oneflow
oneflow copied to clipboard
Compile mode rank per process
多进程分离编译。
核心思路:1)分发job而不是分发plan;2)在master上为每个task分配好task_id,然后分发给各个worker,worker进程在编译的时候直接使用这些task_id;3)regst_desc_id/mem_block_id/chunk_id在分配时分按照rank分段,保证不同rank上的plan肯定不会发生id冲突。
该多进程方案仍然无法处理非均匀切割下大量comm_net op所带来的编译时间过长问题。
libai bert&gpt 正确性验证
https://github.com/Oneflow-Inc/oneflow/commit/bc2c2cba7bd831deb999104d6704562309081203 https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063 oneflow-25 & oneflow-28
-
1n1g
-
LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp1_mb32_gb128_1n1g
-
-
1n4g
-
LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb32_gb256_1n4g
-
LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb8_gb64_1n4g
单机的loss曲线基本重合,正确性无误
-
-
2n4g
LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp2_mb64_gb512_2n4g
报错F20221026 03:20:24.009537 321847 task_graph.cpp:1272] Check failed: src->parallel_desc_sym() == dst->parallel_desc_sym() *** Check failure stack trace: *** @ 0x7f0cdb323fba google::LogMessage::Fail() @ 0x7f6f21ae7fba google::LogMessage::Fail() @ 0x7f0cdb3242a2 google::LogMessage::SendToLog() @ 0x7f6fef6ddfba google::LogMessage::Fail() @ 0x7f6f21ae82a2 google::LogMessage::SendToLog() @ 0x7f0cdb323b27 google::LogMessage::Flush() @ 0x7f6fef6de2a2 google::LogMessage::SendToLog() @ 0x7f6f21ae7b27 google::LogMessage::Flush() @ 0x7f0cdb326699 google::LogMessageFatal::~LogMessageFatal() @ 0x7f0cd36ace0d _ZNSt17_Function_handlerIFvPKN7oneflow6OpNodeES3_EZNS0_13RankTaskGraph4InitERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEEbEUlS3_S3_E8_E9_M_invokeERKSt9_Any_dataOS3_SK_ @ 0x7f6f21aea699 google::LogMessageFatal::~LogMessageFatal() @ 0x7f6fef6ddb27 google::LogMessage::Flush() @ 0x7f0cd36a4675 _ZZN7oneflow12_GLOBAL__N_131ForEachOpGraphNecessaryCtrlEdgeIXadL_ZNKS_7OpGraph30cached_predicator_is_reachableEvEEEEvPKS2_RKSt8functionIFvPKNS_6OpNodeES8_EEENKUlPS6_E_clESD_ @ 0x7f6f19e70e0d _ZNSt17_Function_handlerIFvPKN7oneflow6OpNodeES3_EZNS0_13RankTaskGraph4InitERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEEbEUlS3_S3_E8_E9_M_invokeERKSt9_Any_dataOS3_SK_ @ 0x7f6fef6e0699 google::LogMessageFatal::~LogMessageFatal() @ 0x7f6f19e68675 _ZZN7oneflow12_GLOBAL__N_131ForEachOpGraphNecessaryCtrlEdgeIXadL_ZNKS_7OpGraph30cached_predicator_is_reachableEvEEEEvPKS2_RKSt8functionIFvPKNS_6OpNodeES8_EEENKUlPS6_E_clESD_ @ 0x7f6fe7a66e0d _ZNSt17_Function_handlerIFvPKN7oneflow6OpNodeES3_EZNS0_13RankTaskGraph4InitERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEEbEUlS3_S3_E8_E9_M_invokeERKSt9_Any_dataOS3_SK_ @ 0x7f0cd36b7f04 oneflow::RankTaskGraph::Init() @ 0x7f6fe7a5e675 _ZZN7oneflow12_GLOBAL__N_131ForEachOpGraphNecessaryCtrlEdgeIXadL_ZNKS_7OpGraph30cached_predicator_is_reachableEvEEEEvPKS2_RKSt8functionIFvPKNS_6OpNodeES8_EEENKUlPS6_E_clESD_
2机正确性测试
正确性没问题, @lixinqi
这里的编译速度有测试比对结果嘛 @strint @lixinqi
这里的编译速度有测试比对结果嘛 @strint @lixinqi
https://github.com/Oneflow-Inc/OneTeam/issues/1679