OneFlow fix zero在libai的回归测试
PR:https://github.com/Oneflow-Inc/oneflow/pull/7557
这个issue的方案已经确定,辛苦用这个分支试验一下(我这里已经验证通过)。
验证点: 1、zero的可以执行,就是issue的内容; @CPFLAME 2、libai的混合并行,性能正常(这个改动涉及一个基础的sbp推理限制,想验证这个改动对性能没有负面影响); @L1aoXingyu
验证通过,就合并这个PR;
这个分支在libai的master下经过测试:
目前可以跑通的配置:
数据并行=2+模型并行=2数据并行=2+模型并行=2+流水并行=2数据并行=2+流水并行=2+zero数据并行=4+zero模型并行=2+zero
2d sbp + zero 会报错, 1d sbp + zero好像都是可以跑的
目前报错的配置:
数据并行=2+模型并行=2+zero
复现错误的运行指令:
sh tools/train.sh configs/t5_pp_pretrain.py 4 train.dist.tensor_parallel_size=2 train.dist.pipeline_parallel_size=1 train.dist.data_parallel_size=2 train.zero_optimization.enabled=True train.zero_optimization.stage=3 train.log_period=1 train.train_micro_batch_size=32
错误信息
F20220301 16:50:26.966737 38104 op_graph.cpp:36]
File "/home/chengpeng/data/oneflow/oneflow/core/graph/op_graph.cpp", line 36, in SbpParallel4BnInOp
op().SbpParallel4BnInOp(bn_in_op)
File "/home/chengpeng/data/oneflow/oneflow/core/operator/operator.cpp", line 938, in SbpParallel4BnInOp
Check failed: sbp_signature_ sbp signature not infered
*** Check failure stack trace: ***
@ 0x7f10f5ab364d google::LogMessage::Fail()
@ 0x7f10f5ab584c google::LogMessage::SendToLog()
@ 0x7f10f5ab30ea google::LogMessage::Flush()
@ 0x7f10f5ab6229 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f10fac89e67 oneflow::OpNode::SbpParallel4BnInOp()
@ 0x7f10faedabe6 _ZNSt17_Function_handlerIFvPN7oneflow6OpNodeEEZNS0_12_GLOBAL__N_131ForEachDataParallelNodeSequenceERKNS0_7OpGraphERKSt8functionIFbPKS1_EES8_IFvOSt10shared_ptrIKNS4_24DataParallelNodeSequenceEEEEEUlSA_E_E9_M_invokeERKSt9_Any_dataOS2_
@ 0x7f10fac963b4 oneflow::Graph<>::ForEachNode()
@ 0x7f10faedbc1f oneflow::(anonymous namespace)::ForEachParallelSortedNodeSequence()
@ 0x7f10faedea7c oneflow::(anonymous namespace)::OptimizerPlacementOptimizationPass::Apply()
@ 0x7f10fad7ad63 _ZZN7oneflow23LazyJobBuildAndInferCtx8CompleteEvENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiE1_clES8_i
@ 0x7f10fad7c680 oneflow::LazyJobBuildAndInferCtx::Complete()
@ 0x7f111bac78f3 CurJobBuildAndInferCtx_Complete()
F20220301 16:50:27.006513 38101 op_graph.cpp:36]
File "/home/chengpeng/data/oneflow/oneflow/core/graph/op_graph.cpp", line 36, in SbpParallel4BnInOp
op().SbpParallel4BnInOp(bn_in_op)
File "/home/chengpeng/data/oneflow/oneflow/core/operator/operator.cpp", line 938, in SbpParallel4BnInOp
Check failed: sbp_signature_ sbp signature not infered
@ 0x7f111b9a40a3 _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESO_
*** Check failure stack trace: ***
@ 0x7f111b34df97 pybind11::cpp_function::dispatcher()
@ 0x7f3f1f67064d google::LogMessage::Fail()
@ 0x557f26f6d8b4 _PyMethodDef_RawFastCallKeywords
@ 0x557f26f6d9d1 _PyCFunction_FastCallKeywords
@ 0x557f26fd9e5a _PyEval_EvalFrameDefault
@ 0x557f26f1cd09 _PyEval_EvalCodeWithName
@ 0x557f26f1e01f _PyFunction_FastCallDict
@ 0x557f26f3c7a3 _PyObject_Call_Prepend
@ 0x557f26f2f3ae PyObject_Call
@ 0x7f3f1f67284c google::LogMessage::SendToLog()
@ 0x557f26fd6e97 _PyEval_EvalFrameDefault
@ 0x557f26f1cd09 _PyEval_EvalCodeWithName
@ 0x557f26f1e01f _PyFunction_FastCallDict
@ 0x557f26f3c7a3 _PyObject_Call_Prepend
@ 0x557f26f2f3ae PyObject_Call
@ 0x557f26fd6e97 _PyEval_EvalFrameDefault
@ 0x557f26f1cd09 _PyEval_EvalCodeWithName
@ 0x7f3f1f6700ea google::LogMessage::Flush()
@ 0x557f26f1e01f _PyFunction_FastCallDict
@ 0x557f26f3c7a3 _PyObject_Call_Prepend
@ 0x557f26f73dea slot_tp_call
@ 0x557f26f2f3ae PyObject_Call
@ 0x7f3f1f673229 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f3f24846e67 oneflow::OpNode::SbpParallel4BnInOp()
@ 0x7f3f24a97be6 _ZNSt17_Function_handlerIFvPN7oneflow6OpNodeEEZNS0_12_GLOBAL__N_131ForEachDataParallelNodeSequenceERKNS0_7OpGraphERKSt8functionIFbPKS1_EES8_IFvOSt10shared_ptrIKNS4_24DataParallelNodeSequenceEEEEEUlSA_E_E9_M_invokeERKSt9_Any_dataOS2_
F20220301 16:50:27.025559 38102 op_graph.cpp:36]
File "/home/chengpeng/data/oneflow/oneflow/core/graph/op_graph.cpp", line 36, in SbpParallel4BnInOp
op().SbpParallel4BnInOp(bn_in_op)
File "/home/chengpeng/data/oneflow/oneflow/core/operator/operator.cpp", line 938, in SbpParallel4BnInOp
Check failed: sbp_signature_ sbp signature not infered
*** Check failure stack trace: ***
@ 0x7fdca394f64d google::LogMessage::Fail()
@ 0x7f3f248533b4 oneflow::Graph<>::ForEachNode()
@ 0x7fdca395184c google::LogMessage::SendToLog()
@ 0x7f3f24a98c1f oneflow::(anonymous namespace)::ForEachParallelSortedNodeSequence()
@ 0x7fdca394f0ea google::LogMessage::Flush()
@ 0x7f3f24a9ba7c oneflow::(anonymous namespace)::OptimizerPlacementOptimizationPass::Apply()
@ 0x7fdca3952229 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f3f24937d63 _ZZN7oneflow23LazyJobBuildAndInferCtx8CompleteEvENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiE1_clES8_i
@ 0x7f3f24939680 oneflow::LazyJobBuildAndInferCtx::Complete()
@ 0x7fdca8b25e67 oneflow::OpNode::SbpParallel4BnInOp()
@ 0x7fdca8d76be6 _ZNSt17_Function_handlerIFvPN7oneflow6OpNodeEEZNS0_12_GLOBAL__N_131ForEachDataParallelNodeSequenceERKNS0_7OpGraphERKSt8functionIFbPKS1_EES8_IFvOSt10shared_ptrIKNS4_24DataParallelNodeSequenceEEEEEUlSA_E_E9_M_invokeERKSt9_Any_dataOS2_
@ 0x7f3f456848f3 CurJobBuildAndInferCtx_Complete()
@ 0x7f3f455610a3 _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESO_
@ 0x7fdca8b323b4 oneflow::Graph<>::ForEachNode()
@ 0x7f3f44f0af97 pybind11::cpp_function::dispatcher()
@ 0x55c35639f8b4 _PyMethodDef_RawFastCallKeywords
@ 0x55c35639f9d1 _PyCFunction_FastCallKeywords
@ 0x55c35640be5a _PyEval_EvalFrameDefault
@ 0x55c35634ed09 _PyEval_EvalCodeWithName
@ 0x55c35635001f _PyFunction_FastCallDict
@ 0x55c35636e7a3 _PyObject_Call_Prepend
@ 0x7fdca8d77c1f oneflow::(anonymous namespace)::ForEachParallelSortedNodeSequence()
@ 0x55c3563613ae PyObject_Call
@ 0x55c356408e97 _PyEval_EvalFrameDefault
@ 0x55c35634ed09 _PyEval_EvalCodeWithName
@ 0x55c35635001f _PyFunction_FastCallDict
@ 0x55c35636e7a3 _PyObject_Call_Prepend
@ 0x55c3563613ae PyObject_Call
@ 0x55c356408e97 _PyEval_EvalFrameDefault
@ 0x55c35634ed09 _PyEval_EvalCodeWithName
@ 0x7fdca8d7aa7c oneflow::(anonymous namespace)::OptimizerPlacementOptimizationPass::Apply()
@ 0x55c35635001f _PyFunction_FastCallDict
@ 0x55c35636e7a3 _PyObject_Call_Prepend
@ 0x55c3563a5dea slot_tp_call
@ 0x55c3563613ae PyObject_Call
@ 0x7fdca8c16d63 _ZZN7oneflow23LazyJobBuildAndInferCtx8CompleteEvENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiE1_clES8_i
@ 0x7fdca8c18680 oneflow::LazyJobBuildAndInferCtx::Complete()
@ 0x7fdcc99638f3 CurJobBuildAndInferCtx_Complete()
@ 0x7fdcc98400a3 _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESO_
@ 0x7fdcc91e9f97 pybind11::cpp_function::dispatcher()
@ 0x55d153a648b4 _PyMethodDef_RawFastCallKeywords
@ 0x55d153a649d1 _PyCFunction_FastCallKeywords
@ 0x55d153ad0e5a _PyEval_EvalFrameDefault
@ 0x55d153a13d09 _PyEval_EvalCodeWithName
@ 0x55d153a1501f _PyFunction_FastCallDict
@ 0x55d153a337a3 _PyObject_Call_Prepend
@ 0x55d153a263ae PyObject_Call
@ 0x55d153acde97 _PyEval_EvalFrameDefault
@ 0x55d153a13d09 _PyEval_EvalCodeWithName
@ 0x55d153a1501f _PyFunction_FastCallDict
@ 0x55d153a337a3 _PyObject_Call_Prepend
@ 0x55d153a263ae PyObject_Call
@ 0x55d153acde97 _PyEval_EvalFrameDefault
@ 0x55d153a13d09 _PyEval_EvalCodeWithName
@ 0x55d153a1501f _PyFunction_FastCallDict
@ 0x55d153a337a3 _PyObject_Call_Prepend
@ 0x55d153a6adea slot_tp_call
@ 0x55d153a263ae PyObject_Call
F20220301 16:50:27.295626 38103 op_graph.cpp:36]
File "/home/chengpeng/data/oneflow/oneflow/core/graph/op_graph.cpp", line 36, in SbpParallel4BnInOp
op().SbpParallel4BnInOp(bn_in_op)
File "/home/chengpeng/data/oneflow/oneflow/core/operator/operator.cpp", line 938, in SbpParallel4BnInOp
Check failed: sbp_signature_ sbp signature not infered
*** Check failure stack trace: ***
@ 0x7f562fe9564d google::LogMessage::Fail()
@ 0x7f562fe9784c google::LogMessage::SendToLog()
@ 0x7f562fe950ea google::LogMessage::Flush()
@ 0x7f562fe98229 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f563506be67 oneflow::OpNode::SbpParallel4BnInOp()
@ 0x7f56352bcbe6 _ZNSt17_Function_handlerIFvPN7oneflow6OpNodeEEZNS0_12_GLOBAL__N_131ForEachDataParallelNodeSequenceERKNS0_7OpGraphERKSt8functionIFbPKS1_EES8_IFvOSt10shared_ptrIKNS4_24DataParallelNodeSequenceEEEEEUlSA_E_E9_M_invokeERKSt9_Any_dataOS2_
@ 0x7f56350783b4 oneflow::Graph<>::ForEachNode()
@ 0x7f56352bdc1f oneflow::(anonymous namespace)::ForEachParallelSortedNodeSequence()
@ 0x7f56352c0a7c oneflow::(anonymous namespace)::OptimizerPlacementOptimizationPass::Apply()
@ 0x7f563515cd63 _ZZN7oneflow23LazyJobBuildAndInferCtx8CompleteEvENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiE1_clES8_i
@ 0x7f563515e680 oneflow::LazyJobBuildAndInferCtx::Complete()
@ 0x7f5655ea98f3 CurJobBuildAndInferCtx_Complete()
@ 0x7f5655d860a3 _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESO_
@ 0x7f565572ff97 pybind11::cpp_function::dispatcher()
@ 0x561046d088b4 _PyMethodDef_RawFastCallKeywords
@ 0x561046d089d1 _PyCFunction_FastCallKeywords
@ 0x561046d74e5a _PyEval_EvalFrameDefault
@ 0x561046cb7d09 _PyEval_EvalCodeWithName
@ 0x561046cb901f _PyFunction_FastCallDict
@ 0x561046cd77a3 _PyObject_Call_Prepend
@ 0x561046cca3ae PyObject_Call
@ 0x561046d71e97 _PyEval_EvalFrameDefault
@ 0x561046cb7d09 _PyEval_EvalCodeWithName
@ 0x561046cb901f _PyFunction_FastCallDict
@ 0x561046cd77a3 _PyObject_Call_Prepend
@ 0x561046cca3ae PyObject_Call
@ 0x561046d71e97 _PyEval_EvalFrameDefault
@ 0x561046cb7d09 _PyEval_EvalCodeWithName
@ 0x561046cb901f _PyFunction_FastCallDict
@ 0x561046cd77a3 _PyObject_Call_Prepend
@ 0x561046d0edea slot_tp_call
@ 0x561046cca3ae PyObject_Call
Killing subprocess 38101
Killing subprocess 38102
Killing subprocess 38103
Killing subprocess 38104
Traceback (most recent call last):
File "/home/chengpeng/data/anaconda3/envs/oneflow-dev-gcc7-v2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/chengpeng/data/anaconda3/envs/oneflow-dev-gcc7-v2/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/chengpeng/data/oneflow/python/oneflow/distributed/launch.py", line 223, in <module>
main()
File "/home/chengpeng/data/oneflow/python/oneflow/distributed/launch.py", line 211, in main
sigkill_handler(signal.SIGTERM, None)
File "/home/chengpeng/data/oneflow/python/oneflow/distributed/launch.py", line 180, in sigkill_handler
returncode=last_return_code, cmd=cmd
subprocess.CalledProcessError: Command '['/home/chengpeng/data/anaconda3/envs/oneflow-dev-gcc7-v2/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'configs/t5_pp_pretrain.py', 'train.dist.tensor_parallel_size=2', 'train.dist.pipeline_parallel_size=1', 'train.dist.data_parallel_size=2', 'train.zero_optimization.enabled=True', 'train.zero_optimization.stage=3', 'train.log_period=1', 'train.train_micro_batch_size=32']' died with <Signals.SIGABRT: 6>.
t_return_code, cmd=cmd subprocess.CalledProcessError: Command '['/home/chengpeng/data/anaconda3/envs/oneflow-dev-gcc7-v2/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'configs/t5_pp_pretrain.py', 'train.dist.tensor_parallel_size=8', 'train.dist.pipeline_parallel_size=1', 'train.dist.data_parallel_size=1', 'train.zero_optimization.enabled=True', 'train.zero_optimization.stage=3', 'train.log_period=1']' died with <Signals.SIGABRT: 6>. F20220301 16:13:39.021886 52842 ctrl_client.cpp:54] Check failed: rpc_client_.GetStubAt(i)->CallMethodCtrlMethod::kLoadServer( &client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0) Machine 0 lost *** Check failure stack trace: *** F20220301 16:13:39.023948 52841 ctrl_client.cpp:54] Check failed: rpc_client_.GetStubAt(i)->CallMethodCtrlMethod::kLoadServer( &client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0) Machine 0 lost *** Check failure stack trace: *** @ 0x7f31bb78664d google::LogMessage::Fail() @ 0x7fba16dcd64d google::LogMessage::Fail() @ 0x7f31bb78884c google::LogMessage::SendToLog() @ 0x7fba16dcf84c google::LogMessage::SendToLog() @ 0x7f31bb7860ea google::LogMessage::Flush() @ 0x7fba16dcd0ea google::LogMessage::Flush() @ 0x7f31bb789229 google::LogMessageFatal::~LogMessageFatal() @ 0x7fba16dd0229 google::LogMessageFatal::~LogMessageFatal() @ 0x7f31bf8fb462 _ZZN7oneflow14GrpcCtrlClientC4ERKNS_10ProcessCtxEENKUlvE_clEv @ 0x7fba1af42462 _ZZN7oneflow14GrpcCtrlClientC4ERKNS_10ProcessCtxEENKUlvE_clEv @ 0x7f31bb6ed447 execute_native_thread_routine @ 0x7fba16d34447 execute_native_thread_routine @ 0x7f31e91ddea5 start_thread @ 0x7fba44824ea5 start_thread @ 0x7f31e8f068dd __clone @ 0x7fba4454d8dd clone F20220301 16:13:39.281342 51725 rpc_client.cpp:40] Check failed: stub->CallMethod<ctrl_method>(&client_ctx, request, &response).error_code() == grpc::StatusCode::OK (14 vs. 0)
这个属于上一个连接还没释放吧,过一会再跑呢?
这个属于上一个连接还没释放吧,过一会再跑呢?
一样的, 还是有这个错误. 我感觉可能是多卡的报错导致了这个输出信息?
这个属于上一个连接还没释放吧,过一会再跑呢?
一样的, 还是有这个错误. 我感觉可能是多卡的报错导致了这个输出信息?
看错了,具体错误应该是这个:
F20220301 08:31:15.972954 77402 exec_graph.cpp:117]
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/graph/exec_graph.cpp", line 117, in InferBlobDescs
op_->InferBlobDescsIf(GetBlobDesc4BnInOp, parallel_ctx, &GlobalJobDesc())
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/operator.cpp", line 324, in InferBlobDescsIf
InferOutBlobDescsIf(GetBlobDesc4BnInOp, parallel_ctx)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/user_op.cpp", line 599, in InferOutBlobDescs
val_->physical_tensor_desc_infer_fn(&infer_ctx)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/user/ops/reshape_op.cpp", line 101, in InferPhysicalTensorDesc
Check failed: (out_shape->elem_cnt()) == (in_shape.elem_cnt()) (1572864 vs 2359296) Reshape infer ERROR! in op_name: model.t5_model.encoder.layers.0.self_attention-reshape-25 input shape is : (32,512,144) , output shape is : (32,512,1,96) , output logical shape is (32,512,12,96) , And reshape shape conf is : (32,512,12,96) op_loc: Python stack[-2]: <frame at 0x56054967e450, file '/home/ylkj/miniconda3/lib/python3.7/site-packages/oneflow/framework/tensor.py', line 917, code _view>; Python stack[-1]: <frame at 0x56054966bc20, file '/home/ylkj/miniconda3/lib/python3.7/site-packages/oneflow/nn/modules/reshape.py', line 68, code view_op>; C API: <func reshape>
是不是 zero 应该搭配数据并行使用,模型并行本身就不应该一起用
稍等一下 我更新一下实验配置, 做一下更详尽的实验
我直接在comment上修改了, 可以看一下最新的信息 https://github.com/Oneflow-Inc/libai/issues/150#issuecomment-1055148111
测试了本分支在T5上, 各种配置的表现. 其中开启checkpointing+纯数据并行 应该是综合性能最优的选择(实验3和实验6)
T5上的实验数据
| 打开checkpointing, batch_size=32 | 吞吐 | 0卡显存 | 1卡显存 | 2卡显存 | 3卡显存 | 4卡显存 | 5卡显存 | 6卡显存 | 7卡显存 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 数据并行2+模型并行2+流水并行2 | 36.38 | 4509MiB | 4607MiB | 4547MiB | 4615MiB | 4477MiB | 4495MiB | 4489MiB | 4503MiB |
| 2 | 数据并行4+模型并行2 | 74.03 | 6093MiB | 6191MiB | 6141MiB | 6207MiB | 6141MiB | 6205MiB | 6141MiB | 6207MiB |
| 3 | 数据并行8 | 383.10 | 4103MiB | 4135MiB | 4087MiB | 4149MiB | 4135MiB | 4185MiB | 4165MiB | 4127MiB |
| 4 | 数据并行8+zero_stage=2 | 342.61 | 3717MiB | 3799MiB | 3995MiB | 3909MiB | 3763MiB | 3771MiB | 3731MiB | 3747MiB |
| 5 | 数据并行8+zero_stage=2+batch_size=16 | 289.02 | 2765MiB | 2875MiB | 2793MiB | 2779MiB | 2793MiB | 2765MiB | 2775MiB | 2801MiB |
| 关闭Checkpointing, batch_size=16 | ||||||||||
| 6 | 数据并行8 | 447.14 | 8217MiB | 8383MiB | 8217MiB | 8249MiB | 8383MiB | 8383MiB | 8249MiB | 8249MiB |
| 7 | 数据并行4+模型并行2 | 86.57 | 6761MiB | 6825MiB | 6809MiB | 6841MiB | 6809MiB | 6829MiB | 6809MiB | 6841MiB |
| 8 | 数据并行2+模型并行2+流水并行2 | 46.33 | 4537MiB | 4601MiB | 4615MiB | 4709MiB | 3559MiB | 3579MiB | 3579MiB | 3579MiB |
| 9 | 数据并行8+zero_stage=2 | 344.63 | 7873MiB | 7905MiB | 7905MiB | 7905MiB | 7905MiB | 8041MiB | 7905MiB | 7905MiB |
| 10 | 数据并行8+zero_stage=3 | 332.75 | 7999MiB | 8031MiB | 8167MiB | 8031MiB | 8167MiB | 8031MiB | 8031MiB | 8031MiB |
在bert上测试了2d sbp的表现, 和最新的nightly比较了一下, 吞吐和显存没有明显变化 nightly版本信息:
version: 0.7.0.dev20220301+cu102
git_commit: 6946b48
cmake_build_type: Release
rdma: True
mlir: True
| 数据并行4+模型并行2 +batch_size=8 | 吞吐 | 显存 |
|---|---|---|
| nightly | 17.44 | 9621MiB |
| fix_sbp_error | 17.41 | 9587MiB |