libai
libai copied to clipboard
Dev export onnx
这个pr要做的:
- [x] 支持MT5转onnx的脚本, (目前需要在分支https://github.com/Oneflow-Inc/oneflow_convert/tree/fix_t5_export_onnx_bug下进行)
- [x] 支持MT5 onnx 推理的脚本. 直接运行
python libai/onnx_export/onnx_inference/t5_onnx_infer.py
即可, 但是由于onnx的输入和输出都是numpy, 所以目前generate的功能没办法从libai中迁移过来, 目前只能以model.py
中的的model.forward()的输入转换成numpy的格式进行推理
运行:
python libai/onnx_export/t5_to_onnx.py
会报错:
loaded library: /lib/libibverbs.so.1
Distributed env is not set up, configure it by default (single node, single gpu).
F20220913 03:06:14.356911 2954732 math_binary_broadcast_ops.cpp:187] UNIMPLEMENTED
*** Check failure stack trace: ***
@ 0x7fa5f1aa9fda google::LogMessage::Fail()
@ 0x7fa5f1aaa2c2 google::LogMessage::SendToLog()
@ 0x7fa5f1aa9b47 google::LogMessage::Flush()
@ 0x7fa5f1aac6b9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa5eca0d1a2 oneflow::(anonymous namespace)::GetBinaryBroadcastSbpSignature<>()
@ 0x7fa5eca0d4d9 oneflow::BroadcastAddOp::GetSbp()
@ 0x7fa6bf83001c std::_Function_handler<>::_M_invoke()
@ 0x7fa5eaed4288 oneflow::UserOp::GetSbpSignatures()
@ 0x7fa5eae97a43 oneflow::Operator::GetSbpSignaturesIf()
@ 0x7fa5eae9a6b1 oneflow::Operator::InferSbpSignature()
@ 0x7fa5eaed5c60 oneflow::UserOp::InferSbpSignature()
@ 0x7fa5eae80a3d oneflow::Operator::InferSbpSignature()
@ 0x7fa5eae99381 oneflow::Operator::InferNdSbpSignature()
@ 0x7fa5eaed5fc7 oneflow::UserOp::InferNdSbpSignature()
@ 0x7fa5eaea6aaf oneflow::Operator::InferNdSbpSignatureIf()
@ 0x7fa5e9fbc2ff oneflow::JobBuildAndInferCtx::InferOpOutNdSbp()
@ 0x7fa5e9fbf406 oneflow::JobBuildAndInferCtx::AddAndInferOp()
@ 0x7fa5e9fc3e32 oneflow::JobBuildAndInferCtx::AddAndInferGlobalOp()
@ 0x7fa5e98595ad oneflow::one::LazyInterpreter::ApplyImpl()
@ 0x7fa5e985ed17 oneflow::one::LazyInterpreter::Apply()
@ 0x7fa5e985f2eb oneflow::one::AutogradInterpreter::Apply()
@ 0x7fa5e986213c oneflow::one::OpInterpUtil::Dispatch()
@ 0x7fa5e9864856 oneflow::one::OpInterpUtil::Dispatch<>()
@ 0x7fa5e986500e oneflow::one::OpInterpUtil::Dispatch<>()
@ 0x7fa6bfa837a7 oneflow::one::OpInterpUtil::Dispatch<>()
@ 0x7fa5e9a9ae82 oneflow::one::functional::impl::AddFunctor::operator()()
@ 0x7fa5e9a9b990 _ZNSt17_Function_handlerIFN7oneflow5MaybeINS0_3one6TensorEvEERKSt10shared_ptrIS3_ES8_RKNS0_6ScalarERKbEZNS2_10functional18PackedFunctorMakerIFS4_S8_S8_SB_bEE4makeINSF_4impl10AddFunctorELi0EEENSF_13PackedFunctorISE_EERKSsRKT_EUlS8_S8_SB_SD_E_E9_M_invokeERKSt9_Any_dataS8_S8_SB_SD_
@ 0x7fa5ecc7fc44 oneflow::one::functional::Add()
@ 0x7fa6bf893567 oneflow::one::functional::add()
@ 0x7fa6bfa2aa6b (unknown)
@ 0x55e4969067ed PyNumber_Add
@ 0x55e49699274c _PyEval_EvalFrameDefault
[1] 2954732 abort (core dumped) python libai/onnx_export/t5_to_onnx.py
猜测可能是transformer中, 有一些加法或者乘法的语句, 会自动expand tensor的维度, 导致报错 math_binary_broadcast_ops.cpp:187] UNIMPLEMENTED ?
修正了一下代码, 目前报错是
loaded library: /lib/libibverbs.so.1
Distributed env is not set up, configure it by default (single node, single gpu).
Traceback (most recent call last):
File "libai/onnx_export/t5_to_onnx.py", line 64, in <module>
export_onnx_model(t5_graph,
File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow_onnx-0.5.5-py3.8.egg/oneflow_onnx/oneflow2onnx/util.py", line 75, in export_onnx_model
File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow/framework/check_point_v2.py", line 425, in save
pickled_bytes = pickle.dumps(obj)
File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow/framework/check_point_v2.py", line 165, in tensor_getstate
assert self.is_local
AssertionError
看样子目前不支持global tensor的运算, 只支持local tensor的.
修正了一下代码, 目前报错是
loaded library: /lib/libibverbs.so.1 Distributed env is not set up, configure it by default (single node, single gpu). Traceback (most recent call last): File "libai/onnx_export/t5_to_onnx.py", line 64, in <module> export_onnx_model(t5_graph, File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow_onnx-0.5.5-py3.8.egg/oneflow_onnx/oneflow2onnx/util.py", line 75, in export_onnx_model File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow/framework/check_point_v2.py", line 425, in save pickled_bytes = pickle.dumps(obj) File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow/framework/check_point_v2.py", line 165, in tensor_getstate assert self.is_local AssertionError
看样子目前不支持global tensor的运算, 只支持local tensor的.
和 https://github.com/Oneflow-Inc/one-yolov5/issues/23 这里是关联的,正在想解决方案(我周报中也提到了要解决这种有free eager tensor模型的onnx导出问题)。