libai icon indicating copy to clipboard operation
libai copied to clipboard

Dev export onnx

Open CPFLAME opened this issue 2 years ago • 3 comments

这个pr要做的:

  • [x] 支持MT5转onnx的脚本, (目前需要在分支https://github.com/Oneflow-Inc/oneflow_convert/tree/fix_t5_export_onnx_bug下进行)
  • [x] 支持MT5 onnx 推理的脚本. 直接运行python libai/onnx_export/onnx_inference/t5_onnx_infer.py即可, 但是由于onnx的输入和输出都是numpy, 所以目前generate的功能没办法从libai中迁移过来, 目前只能以model.py中的的model.forward()的输入转换成numpy的格式进行推理

CPFLAME avatar Sep 13 '22 03:09 CPFLAME

运行:

python libai/onnx_export/t5_to_onnx.py

会报错:

loaded library: /lib/libibverbs.so.1
Distributed env is not set up, configure it by default (single node, single gpu).
F20220913 03:06:14.356911 2954732 math_binary_broadcast_ops.cpp:187] UNIMPLEMENTED
*** Check failure stack trace: ***
    @     0x7fa5f1aa9fda  google::LogMessage::Fail()
    @     0x7fa5f1aaa2c2  google::LogMessage::SendToLog()
    @     0x7fa5f1aa9b47  google::LogMessage::Flush()
    @     0x7fa5f1aac6b9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa5eca0d1a2  oneflow::(anonymous namespace)::GetBinaryBroadcastSbpSignature<>()
    @     0x7fa5eca0d4d9  oneflow::BroadcastAddOp::GetSbp()
    @     0x7fa6bf83001c  std::_Function_handler<>::_M_invoke()
    @     0x7fa5eaed4288  oneflow::UserOp::GetSbpSignatures()
    @     0x7fa5eae97a43  oneflow::Operator::GetSbpSignaturesIf()
    @     0x7fa5eae9a6b1  oneflow::Operator::InferSbpSignature()
    @     0x7fa5eaed5c60  oneflow::UserOp::InferSbpSignature()
    @     0x7fa5eae80a3d  oneflow::Operator::InferSbpSignature()
    @     0x7fa5eae99381  oneflow::Operator::InferNdSbpSignature()
    @     0x7fa5eaed5fc7  oneflow::UserOp::InferNdSbpSignature()
    @     0x7fa5eaea6aaf  oneflow::Operator::InferNdSbpSignatureIf()
    @     0x7fa5e9fbc2ff  oneflow::JobBuildAndInferCtx::InferOpOutNdSbp()
    @     0x7fa5e9fbf406  oneflow::JobBuildAndInferCtx::AddAndInferOp()
    @     0x7fa5e9fc3e32  oneflow::JobBuildAndInferCtx::AddAndInferGlobalOp()
    @     0x7fa5e98595ad  oneflow::one::LazyInterpreter::ApplyImpl()
    @     0x7fa5e985ed17  oneflow::one::LazyInterpreter::Apply()
    @     0x7fa5e985f2eb  oneflow::one::AutogradInterpreter::Apply()
    @     0x7fa5e986213c  oneflow::one::OpInterpUtil::Dispatch()
    @     0x7fa5e9864856  oneflow::one::OpInterpUtil::Dispatch<>()
    @     0x7fa5e986500e  oneflow::one::OpInterpUtil::Dispatch<>()
    @     0x7fa6bfa837a7  oneflow::one::OpInterpUtil::Dispatch<>()
    @     0x7fa5e9a9ae82  oneflow::one::functional::impl::AddFunctor::operator()()
    @     0x7fa5e9a9b990  _ZNSt17_Function_handlerIFN7oneflow5MaybeINS0_3one6TensorEvEERKSt10shared_ptrIS3_ES8_RKNS0_6ScalarERKbEZNS2_10functional18PackedFunctorMakerIFS4_S8_S8_SB_bEE4makeINSF_4impl10AddFunctorELi0EEENSF_13PackedFunctorISE_EERKSsRKT_EUlS8_S8_SB_SD_E_E9_M_invokeERKSt9_Any_dataS8_S8_SB_SD_
    @     0x7fa5ecc7fc44  oneflow::one::functional::Add()
    @     0x7fa6bf893567  oneflow::one::functional::add()
    @     0x7fa6bfa2aa6b  (unknown)
    @     0x55e4969067ed  PyNumber_Add
    @     0x55e49699274c  _PyEval_EvalFrameDefault
[1]    2954732 abort (core dumped)  python libai/onnx_export/t5_to_onnx.py

猜测可能是transformer中, 有一些加法或者乘法的语句, 会自动expand tensor的维度, 导致报错 math_binary_broadcast_ops.cpp:187] UNIMPLEMENTED ?

CPFLAME avatar Sep 13 '22 03:09 CPFLAME

修正了一下代码, 目前报错是

loaded library: /lib/libibverbs.so.1
Distributed env is not set up, configure it by default (single node, single gpu).
Traceback (most recent call last):
  File "libai/onnx_export/t5_to_onnx.py", line 64, in <module>
    export_onnx_model(t5_graph,
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow_onnx-0.5.5-py3.8.egg/oneflow_onnx/oneflow2onnx/util.py", line 75, in export_onnx_model
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow/framework/check_point_v2.py", line 425, in save
    pickled_bytes = pickle.dumps(obj)
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow/framework/check_point_v2.py", line 165, in tensor_getstate
    assert self.is_local
AssertionError

看样子目前不支持global tensor的运算, 只支持local tensor的.

CPFLAME avatar Sep 13 '22 05:09 CPFLAME

修正了一下代码, 目前报错是

loaded library: /lib/libibverbs.so.1
Distributed env is not set up, configure it by default (single node, single gpu).
Traceback (most recent call last):
  File "libai/onnx_export/t5_to_onnx.py", line 64, in <module>
    export_onnx_model(t5_graph,
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow_onnx-0.5.5-py3.8.egg/oneflow_onnx/oneflow2onnx/util.py", line 75, in export_onnx_model
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow/framework/check_point_v2.py", line 425, in save
    pickled_bytes = pickle.dumps(obj)
  File "/home/chengpeng/miniconda3/envs/libai/lib/python3.8/site-packages/oneflow/framework/check_point_v2.py", line 165, in tensor_getstate
    assert self.is_local
AssertionError

看样子目前不支持global tensor的运算, 只支持local tensor的.

和 https://github.com/Oneflow-Inc/one-yolov5/issues/23 这里是关联的,正在想解决方案(我周报中也提到了要解决这种有free eager tensor模型的onnx导出问题)。

BBuf avatar Sep 13 '22 06:09 BBuf