libai icon indicating copy to clipboard operation
libai copied to clipboard

OneFlow fix zero在libai的回归测试

Open strint opened this issue 3 years ago • 8 comments

PR:https://github.com/Oneflow-Inc/oneflow/pull/7557

这个issue的方案已经确定,辛苦用这个分支试验一下(我这里已经验证通过)。

验证点: 1、zero的可以执行,就是issue的内容; @CPFLAME 2、libai的混合并行,性能正常(这个改动涉及一个基础的sbp推理限制,想验证这个改动对性能没有负面影响); @L1aoXingyu

验证通过,就合并这个PR;

strint avatar Mar 01 '22 07:03 strint

这个分支在libai的master下经过测试:

目前可以跑通的配置:

  • 数据并行=2 + 模型并行=2
  • 数据并行=2 + 模型并行=2 + 流水并行=2
  • 数据并行=2 + 流水并行=2 + zero
  • 数据并行=4 + zero
  • 模型并行=2 + zero

2d sbp + zero 会报错, 1d sbp + zero好像都是可以跑的 目前报错的配置:

  • 数据并行=2 + 模型并行=2 + zero

复现错误的运行指令:

sh tools/train.sh configs/t5_pp_pretrain.py 4 train.dist.tensor_parallel_size=2 train.dist.pipeline_parallel_size=1 train.dist.data_parallel_size=2 train.zero_optimization.enabled=True train.zero_optimization.stage=3 train.log_period=1 train.train_micro_batch_size=32

错误信息

F20220301 16:50:26.966737 38104 op_graph.cpp:36] 
  File "/home/chengpeng/data/oneflow/oneflow/core/graph/op_graph.cpp", line 36, in SbpParallel4BnInOp
    op().SbpParallel4BnInOp(bn_in_op)
  File "/home/chengpeng/data/oneflow/oneflow/core/operator/operator.cpp", line 938, in SbpParallel4BnInOp
    Check failed: sbp_signature_ sbp signature not infered
*** Check failure stack trace: ***
    @     0x7f10f5ab364d  google::LogMessage::Fail()
    @     0x7f10f5ab584c  google::LogMessage::SendToLog()
    @     0x7f10f5ab30ea  google::LogMessage::Flush()
    @     0x7f10f5ab6229  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f10fac89e67  oneflow::OpNode::SbpParallel4BnInOp()
    @     0x7f10faedabe6  _ZNSt17_Function_handlerIFvPN7oneflow6OpNodeEEZNS0_12_GLOBAL__N_131ForEachDataParallelNodeSequenceERKNS0_7OpGraphERKSt8functionIFbPKS1_EES8_IFvOSt10shared_ptrIKNS4_24DataParallelNodeSequenceEEEEEUlSA_E_E9_M_invokeERKSt9_Any_dataOS2_
    @     0x7f10fac963b4  oneflow::Graph<>::ForEachNode()
    @     0x7f10faedbc1f  oneflow::(anonymous namespace)::ForEachParallelSortedNodeSequence()
    @     0x7f10faedea7c  oneflow::(anonymous namespace)::OptimizerPlacementOptimizationPass::Apply()
    @     0x7f10fad7ad63  _ZZN7oneflow23LazyJobBuildAndInferCtx8CompleteEvENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiE1_clES8_i
    @     0x7f10fad7c680  oneflow::LazyJobBuildAndInferCtx::Complete()
    @     0x7f111bac78f3  CurJobBuildAndInferCtx_Complete()
F20220301 16:50:27.006513 38101 op_graph.cpp:36] 
  File "/home/chengpeng/data/oneflow/oneflow/core/graph/op_graph.cpp", line 36, in SbpParallel4BnInOp
    op().SbpParallel4BnInOp(bn_in_op)
  File "/home/chengpeng/data/oneflow/oneflow/core/operator/operator.cpp", line 938, in SbpParallel4BnInOp
    Check failed: sbp_signature_ sbp signature not infered
    @     0x7f111b9a40a3  _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESO_
*** Check failure stack trace: ***
    @     0x7f111b34df97  pybind11::cpp_function::dispatcher()
    @     0x7f3f1f67064d  google::LogMessage::Fail()
    @     0x557f26f6d8b4  _PyMethodDef_RawFastCallKeywords
    @     0x557f26f6d9d1  _PyCFunction_FastCallKeywords
    @     0x557f26fd9e5a  _PyEval_EvalFrameDefault
    @     0x557f26f1cd09  _PyEval_EvalCodeWithName
    @     0x557f26f1e01f  _PyFunction_FastCallDict
    @     0x557f26f3c7a3  _PyObject_Call_Prepend
    @     0x557f26f2f3ae  PyObject_Call
    @     0x7f3f1f67284c  google::LogMessage::SendToLog()
    @     0x557f26fd6e97  _PyEval_EvalFrameDefault
    @     0x557f26f1cd09  _PyEval_EvalCodeWithName
    @     0x557f26f1e01f  _PyFunction_FastCallDict
    @     0x557f26f3c7a3  _PyObject_Call_Prepend
    @     0x557f26f2f3ae  PyObject_Call
    @     0x557f26fd6e97  _PyEval_EvalFrameDefault
    @     0x557f26f1cd09  _PyEval_EvalCodeWithName
    @     0x7f3f1f6700ea  google::LogMessage::Flush()
    @     0x557f26f1e01f  _PyFunction_FastCallDict
    @     0x557f26f3c7a3  _PyObject_Call_Prepend
    @     0x557f26f73dea  slot_tp_call
    @     0x557f26f2f3ae  PyObject_Call
    @     0x7f3f1f673229  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f3f24846e67  oneflow::OpNode::SbpParallel4BnInOp()
    @     0x7f3f24a97be6  _ZNSt17_Function_handlerIFvPN7oneflow6OpNodeEEZNS0_12_GLOBAL__N_131ForEachDataParallelNodeSequenceERKNS0_7OpGraphERKSt8functionIFbPKS1_EES8_IFvOSt10shared_ptrIKNS4_24DataParallelNodeSequenceEEEEEUlSA_E_E9_M_invokeERKSt9_Any_dataOS2_
F20220301 16:50:27.025559 38102 op_graph.cpp:36] 
  File "/home/chengpeng/data/oneflow/oneflow/core/graph/op_graph.cpp", line 36, in SbpParallel4BnInOp
    op().SbpParallel4BnInOp(bn_in_op)
  File "/home/chengpeng/data/oneflow/oneflow/core/operator/operator.cpp", line 938, in SbpParallel4BnInOp
    Check failed: sbp_signature_ sbp signature not infered
*** Check failure stack trace: ***
    @     0x7fdca394f64d  google::LogMessage::Fail()
    @     0x7f3f248533b4  oneflow::Graph<>::ForEachNode()
    @     0x7fdca395184c  google::LogMessage::SendToLog()
    @     0x7f3f24a98c1f  oneflow::(anonymous namespace)::ForEachParallelSortedNodeSequence()
    @     0x7fdca394f0ea  google::LogMessage::Flush()
    @     0x7f3f24a9ba7c  oneflow::(anonymous namespace)::OptimizerPlacementOptimizationPass::Apply()
    @     0x7fdca3952229  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f3f24937d63  _ZZN7oneflow23LazyJobBuildAndInferCtx8CompleteEvENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiE1_clES8_i
    @     0x7f3f24939680  oneflow::LazyJobBuildAndInferCtx::Complete()
    @     0x7fdca8b25e67  oneflow::OpNode::SbpParallel4BnInOp()
    @     0x7fdca8d76be6  _ZNSt17_Function_handlerIFvPN7oneflow6OpNodeEEZNS0_12_GLOBAL__N_131ForEachDataParallelNodeSequenceERKNS0_7OpGraphERKSt8functionIFbPKS1_EES8_IFvOSt10shared_ptrIKNS4_24DataParallelNodeSequenceEEEEEUlSA_E_E9_M_invokeERKSt9_Any_dataOS2_
    @     0x7f3f456848f3  CurJobBuildAndInferCtx_Complete()
    @     0x7f3f455610a3  _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESO_
    @     0x7fdca8b323b4  oneflow::Graph<>::ForEachNode()
    @     0x7f3f44f0af97  pybind11::cpp_function::dispatcher()
    @     0x55c35639f8b4  _PyMethodDef_RawFastCallKeywords
    @     0x55c35639f9d1  _PyCFunction_FastCallKeywords
    @     0x55c35640be5a  _PyEval_EvalFrameDefault
    @     0x55c35634ed09  _PyEval_EvalCodeWithName
    @     0x55c35635001f  _PyFunction_FastCallDict
    @     0x55c35636e7a3  _PyObject_Call_Prepend
    @     0x7fdca8d77c1f  oneflow::(anonymous namespace)::ForEachParallelSortedNodeSequence()
    @     0x55c3563613ae  PyObject_Call
    @     0x55c356408e97  _PyEval_EvalFrameDefault
    @     0x55c35634ed09  _PyEval_EvalCodeWithName
    @     0x55c35635001f  _PyFunction_FastCallDict
    @     0x55c35636e7a3  _PyObject_Call_Prepend
    @     0x55c3563613ae  PyObject_Call
    @     0x55c356408e97  _PyEval_EvalFrameDefault
    @     0x55c35634ed09  _PyEval_EvalCodeWithName
    @     0x7fdca8d7aa7c  oneflow::(anonymous namespace)::OptimizerPlacementOptimizationPass::Apply()
    @     0x55c35635001f  _PyFunction_FastCallDict
    @     0x55c35636e7a3  _PyObject_Call_Prepend
    @     0x55c3563a5dea  slot_tp_call
    @     0x55c3563613ae  PyObject_Call
    @     0x7fdca8c16d63  _ZZN7oneflow23LazyJobBuildAndInferCtx8CompleteEvENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiE1_clES8_i
    @     0x7fdca8c18680  oneflow::LazyJobBuildAndInferCtx::Complete()
    @     0x7fdcc99638f3  CurJobBuildAndInferCtx_Complete()
    @     0x7fdcc98400a3  _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESO_
    @     0x7fdcc91e9f97  pybind11::cpp_function::dispatcher()
    @     0x55d153a648b4  _PyMethodDef_RawFastCallKeywords
    @     0x55d153a649d1  _PyCFunction_FastCallKeywords
    @     0x55d153ad0e5a  _PyEval_EvalFrameDefault
    @     0x55d153a13d09  _PyEval_EvalCodeWithName
    @     0x55d153a1501f  _PyFunction_FastCallDict
    @     0x55d153a337a3  _PyObject_Call_Prepend
    @     0x55d153a263ae  PyObject_Call
    @     0x55d153acde97  _PyEval_EvalFrameDefault
    @     0x55d153a13d09  _PyEval_EvalCodeWithName
    @     0x55d153a1501f  _PyFunction_FastCallDict
    @     0x55d153a337a3  _PyObject_Call_Prepend
    @     0x55d153a263ae  PyObject_Call
    @     0x55d153acde97  _PyEval_EvalFrameDefault
    @     0x55d153a13d09  _PyEval_EvalCodeWithName
    @     0x55d153a1501f  _PyFunction_FastCallDict
    @     0x55d153a337a3  _PyObject_Call_Prepend
    @     0x55d153a6adea  slot_tp_call
    @     0x55d153a263ae  PyObject_Call
F20220301 16:50:27.295626 38103 op_graph.cpp:36] 
  File "/home/chengpeng/data/oneflow/oneflow/core/graph/op_graph.cpp", line 36, in SbpParallel4BnInOp
    op().SbpParallel4BnInOp(bn_in_op)
  File "/home/chengpeng/data/oneflow/oneflow/core/operator/operator.cpp", line 938, in SbpParallel4BnInOp
    Check failed: sbp_signature_ sbp signature not infered
*** Check failure stack trace: ***
    @     0x7f562fe9564d  google::LogMessage::Fail()
    @     0x7f562fe9784c  google::LogMessage::SendToLog()
    @     0x7f562fe950ea  google::LogMessage::Flush()
    @     0x7f562fe98229  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f563506be67  oneflow::OpNode::SbpParallel4BnInOp()
    @     0x7f56352bcbe6  _ZNSt17_Function_handlerIFvPN7oneflow6OpNodeEEZNS0_12_GLOBAL__N_131ForEachDataParallelNodeSequenceERKNS0_7OpGraphERKSt8functionIFbPKS1_EES8_IFvOSt10shared_ptrIKNS4_24DataParallelNodeSequenceEEEEEUlSA_E_E9_M_invokeERKSt9_Any_dataOS2_
    @     0x7f56350783b4  oneflow::Graph<>::ForEachNode()
    @     0x7f56352bdc1f  oneflow::(anonymous namespace)::ForEachParallelSortedNodeSequence()
    @     0x7f56352c0a7c  oneflow::(anonymous namespace)::OptimizerPlacementOptimizationPass::Apply()
    @     0x7f563515cd63  _ZZN7oneflow23LazyJobBuildAndInferCtx8CompleteEvENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiE1_clES8_i
    @     0x7f563515e680  oneflow::LazyJobBuildAndInferCtx::Complete()
    @     0x7f5655ea98f3  CurJobBuildAndInferCtx_Complete()
    @     0x7f5655d860a3  _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESO_
    @     0x7f565572ff97  pybind11::cpp_function::dispatcher()
    @     0x561046d088b4  _PyMethodDef_RawFastCallKeywords
    @     0x561046d089d1  _PyCFunction_FastCallKeywords
    @     0x561046d74e5a  _PyEval_EvalFrameDefault
    @     0x561046cb7d09  _PyEval_EvalCodeWithName
    @     0x561046cb901f  _PyFunction_FastCallDict
    @     0x561046cd77a3  _PyObject_Call_Prepend
    @     0x561046cca3ae  PyObject_Call
    @     0x561046d71e97  _PyEval_EvalFrameDefault
    @     0x561046cb7d09  _PyEval_EvalCodeWithName
    @     0x561046cb901f  _PyFunction_FastCallDict
    @     0x561046cd77a3  _PyObject_Call_Prepend
    @     0x561046cca3ae  PyObject_Call
    @     0x561046d71e97  _PyEval_EvalFrameDefault
    @     0x561046cb7d09  _PyEval_EvalCodeWithName
    @     0x561046cb901f  _PyFunction_FastCallDict
    @     0x561046cd77a3  _PyObject_Call_Prepend
    @     0x561046d0edea  slot_tp_call
    @     0x561046cca3ae  PyObject_Call
Killing subprocess 38101
Killing subprocess 38102
Killing subprocess 38103
Killing subprocess 38104
Traceback (most recent call last):
  File "/home/chengpeng/data/anaconda3/envs/oneflow-dev-gcc7-v2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/chengpeng/data/anaconda3/envs/oneflow-dev-gcc7-v2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/chengpeng/data/oneflow/python/oneflow/distributed/launch.py", line 223, in <module>
    main()
  File "/home/chengpeng/data/oneflow/python/oneflow/distributed/launch.py", line 211, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/home/chengpeng/data/oneflow/python/oneflow/distributed/launch.py", line 180, in sigkill_handler
    returncode=last_return_code, cmd=cmd
subprocess.CalledProcessError: Command '['/home/chengpeng/data/anaconda3/envs/oneflow-dev-gcc7-v2/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'configs/t5_pp_pretrain.py', 'train.dist.tensor_parallel_size=2', 'train.dist.pipeline_parallel_size=1', 'train.dist.data_parallel_size=2', 'train.zero_optimization.enabled=True', 'train.zero_optimization.stage=3', 'train.log_period=1', 'train.train_micro_batch_size=32']' died with <Signals.SIGABRT: 6>.

CPFLAME avatar Mar 01 '22 08:03 CPFLAME

t_return_code, cmd=cmd subprocess.CalledProcessError: Command '['/home/chengpeng/data/anaconda3/envs/oneflow-dev-gcc7-v2/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'configs/t5_pp_pretrain.py', 'train.dist.tensor_parallel_size=8', 'train.dist.pipeline_parallel_size=1', 'train.dist.data_parallel_size=1', 'train.zero_optimization.enabled=True', 'train.zero_optimization.stage=3', 'train.log_period=1']' died with <Signals.SIGABRT: 6>. F20220301 16:13:39.021886 52842 ctrl_client.cpp:54] Check failed: rpc_client_.GetStubAt(i)->CallMethodCtrlMethod::kLoadServer( &client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0) Machine 0 lost *** Check failure stack trace: *** F20220301 16:13:39.023948 52841 ctrl_client.cpp:54] Check failed: rpc_client_.GetStubAt(i)->CallMethodCtrlMethod::kLoadServer( &client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0) Machine 0 lost *** Check failure stack trace: *** @ 0x7f31bb78664d google::LogMessage::Fail() @ 0x7fba16dcd64d google::LogMessage::Fail() @ 0x7f31bb78884c google::LogMessage::SendToLog() @ 0x7fba16dcf84c google::LogMessage::SendToLog() @ 0x7f31bb7860ea google::LogMessage::Flush() @ 0x7fba16dcd0ea google::LogMessage::Flush() @ 0x7f31bb789229 google::LogMessageFatal::~LogMessageFatal() @ 0x7fba16dd0229 google::LogMessageFatal::~LogMessageFatal() @ 0x7f31bf8fb462 _ZZN7oneflow14GrpcCtrlClientC4ERKNS_10ProcessCtxEENKUlvE_clEv @ 0x7fba1af42462 _ZZN7oneflow14GrpcCtrlClientC4ERKNS_10ProcessCtxEENKUlvE_clEv @ 0x7f31bb6ed447 execute_native_thread_routine @ 0x7fba16d34447 execute_native_thread_routine @ 0x7f31e91ddea5 start_thread @ 0x7fba44824ea5 start_thread @ 0x7f31e8f068dd __clone @ 0x7fba4454d8dd clone F20220301 16:13:39.281342 51725 rpc_client.cpp:40] Check failed: stub->CallMethod<ctrl_method>(&client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0)

这个属于上一个连接还没释放吧,过一会再跑呢?

ouyangyu avatar Mar 01 '22 08:03 ouyangyu

这个属于上一个连接还没释放吧,过一会再跑呢?

一样的, 还是有这个错误. 我感觉可能是多卡的报错导致了这个输出信息?

CPFLAME avatar Mar 01 '22 08:03 CPFLAME

这个属于上一个连接还没释放吧,过一会再跑呢?

一样的, 还是有这个错误. 我感觉可能是多卡的报错导致了这个输出信息?

看错了,具体错误应该是这个:

F20220301 08:31:15.972954 77402 exec_graph.cpp:117]
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/graph/exec_graph.cpp", line 117, in InferBlobDescs
    op_->InferBlobDescsIf(GetBlobDesc4BnInOp, parallel_ctx, &GlobalJobDesc())
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/operator.cpp", line 324, in InferBlobDescsIf
    InferOutBlobDescsIf(GetBlobDesc4BnInOp, parallel_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/user_op.cpp", line 599, in InferOutBlobDescs
    val_->physical_tensor_desc_infer_fn(&infer_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/user/ops/reshape_op.cpp", line 101, in InferPhysicalTensorDesc
    Check failed: (out_shape->elem_cnt()) == (in_shape.elem_cnt()) (1572864 vs 2359296)  Reshape infer ERROR! in op_name: model.t5_model.encoder.layers.0.self_attention-reshape-25 input shape is : (32,512,144) , output shape is : (32,512,1,96) , output logical shape is (32,512,12,96) , And reshape shape conf is : (32,512,12,96) op_loc: Python stack[-2]: <frame at 0x56054967e450, file '/home/ylkj/miniconda3/lib/python3.7/site-packages/oneflow/framework/tensor.py', line 917, code _view>; Python stack[-1]: <frame at 0x56054966bc20, file '/home/ylkj/miniconda3/lib/python3.7/site-packages/oneflow/nn/modules/reshape.py', line 68, code view_op>; C API: <func reshape>

ouyangyu avatar Mar 01 '22 08:03 ouyangyu

是不是 zero 应该搭配数据并行使用,模型并行本身就不应该一起用

L1aoXingyu avatar Mar 01 '22 08:03 L1aoXingyu

稍等一下 我更新一下实验配置, 做一下更详尽的实验

CPFLAME avatar Mar 01 '22 08:03 CPFLAME

我直接在comment上修改了, 可以看一下最新的信息 https://github.com/Oneflow-Inc/libai/issues/150#issuecomment-1055148111

CPFLAME avatar Mar 01 '22 08:03 CPFLAME

测试了本分支在T5上, 各种配置的表现. 其中开启checkpointing+纯数据并行 应该是综合性能最优的选择(实验3和实验6)

T5上的实验数据

打开checkpointing, batch_size=32 吞吐 0卡显存 1卡显存 2卡显存 3卡显存 4卡显存 5卡显存 6卡显存 7卡显存
1 数据并行2+模型并行2+流水并行2 36.38 4509MiB 4607MiB 4547MiB 4615MiB 4477MiB 4495MiB 4489MiB 4503MiB
2 数据并行4+模型并行2 74.03 6093MiB 6191MiB 6141MiB 6207MiB 6141MiB 6205MiB 6141MiB 6207MiB
3 数据并行8 383.10 4103MiB 4135MiB 4087MiB 4149MiB 4135MiB 4185MiB 4165MiB 4127MiB
4 数据并行8+zero_stage=2 342.61 3717MiB 3799MiB 3995MiB 3909MiB 3763MiB 3771MiB 3731MiB 3747MiB
5 数据并行8+zero_stage=2+batch_size=16 289.02 2765MiB 2875MiB 2793MiB 2779MiB 2793MiB 2765MiB 2775MiB 2801MiB
关闭Checkpointing, batch_size=16
6 数据并行8 447.14 8217MiB 8383MiB 8217MiB 8249MiB 8383MiB 8383MiB 8249MiB 8249MiB
7 数据并行4+模型并行2 86.57 6761MiB 6825MiB 6809MiB 6841MiB 6809MiB 6829MiB 6809MiB 6841MiB
8 数据并行2+模型并行2+流水并行2 46.33 4537MiB 4601MiB 4615MiB 4709MiB 3559MiB 3579MiB 3579MiB 3579MiB
9 数据并行8+zero_stage=2 344.63 7873MiB 7905MiB 7905MiB 7905MiB 7905MiB 8041MiB 7905MiB 7905MiB
10 数据并行8+zero_stage=3 332.75 7999MiB 8031MiB 8167MiB 8031MiB 8167MiB 8031MiB 8031MiB 8031MiB

在bert上测试了2d sbp的表现, 和最新的nightly比较了一下, 吞吐和显存没有明显变化 nightly版本信息:

version: 0.7.0.dev20220301+cu102
git_commit: 6946b48
cmake_build_type: Release
rdma: True
mlir: True
数据并行4+模型并行2 +batch_size=8 吞吐 显存
nightly 17.44 9621MiB
fix_sbp_error 17.41 9587MiB

CPFLAME avatar Mar 02 '22 04:03 CPFLAME