optimum Trying to Load Model Quantized for TensorRT Fails

System Info

Building from source
Running inside the nvcr.io/nvidia/tensorrt:21.07-py3 docker container

(note there is a bug so you will have to build from #286)

Who can help?

@philschmid @JingyaHuang

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.modeling_ort import ORTModelForCausalLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('distilgpt2')
tokenizer = AutoTokenizer.from_pretrained('distilgpt2', padding=True, truncation=True)

qconfig = AutoQuantizationConfig.tensorrt(is_static=False, per_channel=False)
quantizer = ORTQuantizer(model=model, preprocessor=tokenizer, feature="causal-lm")
quantizer.export(
    onnx_model_path="model.onnx",
    onnx_quantized_model_output_path="model-quantized.onnx",
    quantization_config=qconfig,
)

tensorrt_model = ORTModelForCausalLM.load_model("./model-quantized.onnx", provider="TensorrtExecutionProvider")

Results in some error like

RuntimeError: /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1025 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_tensorrt.so with error: /usr/local/lib/python3.8/dist-packages/onnxruntime/capi/libonnxruntime_providers_tensorrt.so: undefined symbol: getBuilderPluginRegistry

Using onnxruntime-gpu==1.11.1 I get this

2022-07-13 04:36:35.784713840 [E:onnxruntime:, inference_session.cc:1587 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc:798 SubGraphCollection_t onnxruntime::TensorrtExecutionProvider::GetSupportedList(SubGraphCollection_t, int, int, const onnxruntime::GraphViewer&, bool*) const [ONNXRuntimeError] : 1 : FAIL : TensorRT input: input has no shape specified. Please run shape inference on the onnx model first. Details can be found in https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#shape-inference-for-tensorrt-subgraphs

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/optimum/onnxruntime/modeling_ort.py", line 143, in load_model
    return ort.InferenceSession(path, providers=[provider])
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 381, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc:798 SubGraphCollection_t onnxruntime::TensorrtExecutionProvider::GetSupportedList(SubGraphCollection_t, int, int, const onnxruntime::GraphViewer&, bool*) const [ONNXRuntimeError] : 1 : FAIL : TensorRT input: input has no shape specified. Please run shape inference on the onnx model first. Details can be found in https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#shape-inference-for-tensorrt-subgraphs

I'm exec-ed into a running NVIDIA TensorRT docker image.

Expected behavior

I get a running inference session using the TensorRT execution provider.

Jul 13 '22 00:07 sam-h-bean

@sam-h-bean is the error reproducible using only onnxruntime and inferenceSession? If so it might make more sense to open an issue in the onnxruntime repo and link to this one

Jul 13 '22 05:07 philschmid

@philschmid I have raised the concerns in the issue above

Jul 13 '22 16:07 sam-h-bean

@philschmid I am also tracking in https://github.com/microsoft/onnxruntime/issues/12133 and https://github.com/microsoft/onnxruntime/issues/12173 but it is becoming unclear if the issue is truly there or if Optimum is creating a quantized model that can not be used by TensorRT when using the TensorRT QuantizationConfig.

Jul 14 '22 02:07 sam-h-bean

@philschmid I am also seeing this weird behavior https://github.com/NVIDIA/TensorRT/issues/2146. I thought this was an oddity of TensorRT but it seems like the same thing is happening when I use your suggested way of putting the pipeline on GPU. If I first have a model on GPU, the second model seems happy with <16GB and will run fine. But if I have an empty GPU and put the model on GPU it consumes all available memory.

Jul 14 '22 17:07 sam-h-bean

After the seq2seq model is converted to onnx, there are three files. When I load it into GPU, there will be an error. ‘CUDA_ERROR_OUT_OF_MEMORY: out of memory’ How can I solve this problem

Aug 03 '22 09:08 Amy234543

What model are you trying to load @Amy234543? And can you provide the sizes of the generated onnx files?

Aug 03 '22 13:08 NouamaneTazi

What model are you trying to load @Amy234543? And can you provide the sizes of the generated onnx files? m2m100_418M,After pruning, the size of the model is 1.4g, and the generated encoder_ Model.onnx is 888m, decoder_ Model.onnx is 1.48g, decoder_ with_ past_ Model.onnx is 1.41g @NouamaneTazi

Aug 04 '22 09:08 Amy234543

Seems like reasonable sizes. Can you provide a script to reproduce the issue? @Amy234543

Aug 04 '22 11:08 NouamaneTazi

Seems like reasonable sizes. Can you provide a script to reproduce the issue? @Amy234543 @NouamaneTazi import torch from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer from optimum.onnxruntime import ORTModelForSeq2SeqLM cuda_idx = 0 model_path = './checkpoint/' device = torch.device(f'cuda:{cuda_idx}') tokenizer = M2M100Tokenizer.from_pretrained('./m2m100_418M') tokenizer.src_lang='zh' model = ORTModelForSeq2SeqLM.from_pretrained(model_path) model.to(device)

The sentence "model.to (device)" will report an error "onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 524746752"

Sometimes ，yell this mistake ‘onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudnnStatus_t; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:115 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudnnStatus_t; bool THRW = true] CUDNN failure 1: CUDNN_STATUS_NOT_INITIALIZED ; GPU=0 ; hostname=container-729d1191ae-744a96c1 ; expr=cudnnCreate(&cudnn_handle_);’

I installed onnxruntime-gpu

Aug 05 '22 01:08 Amy234543

optimum optimum copied to clipboard

Trying to Load Model Quantized for TensorRT Fails

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

optimum
optimum copied to clipboard