oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

Compile mode rank per process

Open lixinqi opened this issue 2 years ago • 5 comments

多进程分离编译。

核心思路:1)分发job而不是分发plan;2)在master上为每个task分配好task_id,然后分发给各个worker,worker进程在编译的时候直接使用这些task_id;3)regst_desc_id/mem_block_id/chunk_id在分配时分按照rank分段,保证不同rank上的plan肯定不会发生id冲突。

lixinqi avatar Oct 12 '22 16:10 lixinqi

该多进程方案仍然无法处理非均匀切割下大量comm_net op所带来的编译时间过长问题。

lixinqi avatar Oct 12 '22 17:10 lixinqi

libai bert&gpt 正确性验证

https://github.com/Oneflow-Inc/oneflow/commit/bc2c2cba7bd831deb999104d6704562309081203 https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063 oneflow-25 & oneflow-28

  • 1n1g

    • LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp1_mb32_gb128_1n1g bert_1n1g_img
  • 1n4g

    • LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb32_gb256_1n4g bert_1n4g_img
    • LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb8_gb64_1n4g gpt_1n4g_img

    单机的loss曲线基本重合,正确性无误

  • 2n4g LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp2_mb64_gb512_2n4g 报错

    F20221026 03:20:24.009537 321847 task_graph.cpp:1272] Check failed: src->parallel_desc_sym() == dst->parallel_desc_sym()
    *** Check failure stack trace: ***
        @     0x7f0cdb323fba  google::LogMessage::Fail()
        @     0x7f6f21ae7fba  google::LogMessage::Fail()
        @     0x7f0cdb3242a2  google::LogMessage::SendToLog()
        @     0x7f6fef6ddfba  google::LogMessage::Fail()
        @     0x7f6f21ae82a2  google::LogMessage::SendToLog()
        @     0x7f0cdb323b27  google::LogMessage::Flush()
        @     0x7f6fef6de2a2  google::LogMessage::SendToLog()
        @     0x7f6f21ae7b27  google::LogMessage::Flush()
        @     0x7f0cdb326699  google::LogMessageFatal::~LogMessageFatal()
        @     0x7f0cd36ace0d  _ZNSt17_Function_handlerIFvPKN7oneflow6OpNodeES3_EZNS0_13RankTaskGraph4InitERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEEbEUlS3_S3_E8_E9_M_invokeERKSt9_Any_dataOS3_SK_
        @     0x7f6f21aea699  google::LogMessageFatal::~LogMessageFatal()
        @     0x7f6fef6ddb27  google::LogMessage::Flush()
        @     0x7f0cd36a4675  _ZZN7oneflow12_GLOBAL__N_131ForEachOpGraphNecessaryCtrlEdgeIXadL_ZNKS_7OpGraph30cached_predicator_is_reachableEvEEEEvPKS2_RKSt8functionIFvPKNS_6OpNodeES8_EEENKUlPS6_E_clESD_
        @     0x7f6f19e70e0d  _ZNSt17_Function_handlerIFvPKN7oneflow6OpNodeES3_EZNS0_13RankTaskGraph4InitERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEEbEUlS3_S3_E8_E9_M_invokeERKSt9_Any_dataOS3_SK_
        @     0x7f6fef6e0699  google::LogMessageFatal::~LogMessageFatal()
        @     0x7f6f19e68675  _ZZN7oneflow12_GLOBAL__N_131ForEachOpGraphNecessaryCtrlEdgeIXadL_ZNKS_7OpGraph30cached_predicator_is_reachableEvEEEEvPKS2_RKSt8functionIFvPKNS_6OpNodeES8_EEENKUlPS6_E_clESD_
        @     0x7f6fe7a66e0d  _ZNSt17_Function_handlerIFvPKN7oneflow6OpNodeES3_EZNS0_13RankTaskGraph4InitERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEEbEUlS3_S3_E8_E9_M_invokeERKSt9_Any_dataOS3_SK_
        @     0x7f0cd36b7f04  oneflow::RankTaskGraph::Init()
        @     0x7f6fe7a5e675  _ZZN7oneflow12_GLOBAL__N_131ForEachOpGraphNecessaryCtrlEdgeIXadL_ZNKS_7OpGraph30cached_predicator_is_reachableEvEEEEvPKS2_RKSt8functionIFvPKNS_6OpNodeES8_EEENKUlPS6_E_clESD_
    

    控制台log

xyn1201 avatar Oct 26 '22 03:10 xyn1201

2机正确性测试

loss_curve

正确性没问题, @lixinqi

ouyangyu avatar Oct 29 '22 03:10 ouyangyu

这里的编译速度有测试比对结果嘛 @strint @lixinqi

chengtbf avatar Nov 17 '22 14:11 chengtbf

这里的编译速度有测试比对结果嘛 @strint @lixinqi

https://github.com/Oneflow-Inc/OneTeam/issues/1679

strint avatar Nov 17 '22 16:11 strint