transformer-deploy icon indicating copy to clipboard operation
transformer-deploy copied to clipboard

t5_bf16 notebooks fails with [ONNXRuntimeError] : 10 : INVALID_GRAPH

Open michaelroyzen opened this issue 1 year ago • 4 comments

I'm running the t5_bf16 notebook with the T0_3B model. Everything works great until

enc_fp16_onnx = create_model_for_provider(encoder_model_path, "CUDAExecutionProvider", log_severity=3)
enc_fp16_onnx_binding: IOBinding = enc_fp16_onnx.io_binding()
dec_onnx = create_model_for_provider(dec_if_model_path, "CUDAExecutionProvider", log_severity=3)
dec_onnx_binding: IOBinding = dec_onnx.io_binding()

causes

InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./test-enc/model.onnx failed:This is an invalid model. Type Error: Type 'tensor(bfloat16)' of input parameter (onnx::Pow_398) of operator (Pow) in node (Pow_138) is invalid.

EDIT 8/1: This is odd, as onnx claims to support Pow in bf16 as of https://github.com/onnx/onnx/pull/3412. The linked PR suggests that only opset 15+ supports the exponent in Pow in bf16. I upgraded the opset version to 15 in convert_to_onnx(), and now I get a RuntimeError when calling create_model_for_provider

RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/optimizer_execution_frame.cc:75 onnxruntime::OptimizerExecutionFrame::Info::Info(const std::vector<const onnxruntime::Node*>&, const InitializedTensorSet&, const onnxruntime::Path&, const onnxruntime::IExecutionProvider&, const std::function<bool(const std::basic_string&)>&) [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : UnpackTensor: the pre-allocate size does not match the size in proto

I'm running PyTorch 1.11.0 and onnx 1.12.0 with onnxruntime 1.12.0. Your help would be greatly appreciated @pommedeterresautee

Hardware: NVIDIA A10 w/ 24GB and hardware bf16 support

michaelroyzen avatar Jul 30 '22 00:07 michaelroyzen

@pommedeterresautee The t5_bf16 notebook doesn't work with t5-3b either, for that matter. It errors on the same line as T0_3B, but for a different reason:

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Deserialize tensor onnx::MatMul_2878 failed.UnpackTensor: the pre-allocate size does not match the size in proto

This is quite important for the project I'm working on, and it would be great if you could help ASAP. Thank you in advance

michaelroyzen avatar Aug 01 '22 23:08 michaelroyzen

I will check in the coming days but TBH not sure you will like BF16 accuracy, it's quite low compared to FP16, which implies adding casting everywhere (it was our hope to not have to do that anymore). The trick is trained in BF16 models are accumulated in FP32, so at the end you need good precision to reproduce the results. Range kills FP16 and precision kills FP16 on deep nets, at the end casting is the only way. One thing which broke many stuff is Python 1.12.0 (it changed the way it stores some values in onnx), we are pushing patches here and there but did not retried those notebooks.

One thing you may want to try is exporting onnx from Pytorch with amp enabled (fp16 and bf16 are both supported). In this video at 6'30 they say that it should work in last pytorch, not had the time to try it myself. https://www.youtube.com/watch?v=R2mUT_s0PbE

If you do, would be very interested to know if it worked for you.

Also found this issue about this possibility, related bug and fixes: https://github.com/pytorch/pytorch/issues/72494

Seems to work... hope it helps in your project.

pommedeterresautee avatar Aug 05 '22 15:08 pommedeterresautee

Thanks @pommedeterresautee. I couldn't find any more information on how I could use amp in the export process.

I'm actually using PyTorch 1.11 for the export (and onnx 1.12.0 and onnxruntime-gpur 1.12.0). The odd thing is that t5-small works in the t5_bf16 notebook but not t5-3b. I'd appreciate your help here.

michaelroyzen avatar Aug 20 '22 19:08 michaelroyzen

@pommedeterresautee Part of the issue seems to be that the notebooks are generally broken with the latest version of the library and its dependencies. I've created a separate issue about that, #130.

michaelroyzen avatar Aug 20 '22 21:08 michaelroyzen