optimum
optimum copied to clipboard
Trying to Load Model Quantized for TensorRT Fails
System Info
Building from source
Running inside the nvcr.io/nvidia/tensorrt:21.07-py3 docker container
(note there is a bug so you will have to build from #286)
Who can help?
@philschmid @JingyaHuang
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.modeling_ort import ORTModelForCausalLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('distilgpt2')
tokenizer = AutoTokenizer.from_pretrained('distilgpt2', padding=True, truncation=True)
qconfig = AutoQuantizationConfig.tensorrt(is_static=False, per_channel=False)
quantizer = ORTQuantizer(model=model, preprocessor=tokenizer, feature="causal-lm")
quantizer.export(
onnx_model_path="model.onnx",
onnx_quantized_model_output_path="model-quantized.onnx",
quantization_config=qconfig,
)
tensorrt_model = ORTModelForCausalLM.load_model("./model-quantized.onnx", provider="TensorrtExecutionProvider")
Results in some error like
RuntimeError: /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1025 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_tensorrt.so with error: /usr/local/lib/python3.8/dist-packages/onnxruntime/capi/libonnxruntime_providers_tensorrt.so: undefined symbol: getBuilderPluginRegistry
Using onnxruntime-gpu==1.11.1 I get this
2022-07-13 04:36:35.784713840 [E:onnxruntime:, inference_session.cc:1587 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc:798 SubGraphCollection_t onnxruntime::TensorrtExecutionProvider::GetSupportedList(SubGraphCollection_t, int, int, const onnxruntime::GraphViewer&, bool*) const [ONNXRuntimeError] : 1 : FAIL : TensorRT input: input has no shape specified. Please run shape inference on the onnx model first. Details can be found in https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#shape-inference-for-tensorrt-subgraphs
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/dist-packages/optimum/onnxruntime/modeling_ort.py", line 143, in load_model
return ort.InferenceSession(path, providers=[provider])
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 381, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc:798 SubGraphCollection_t onnxruntime::TensorrtExecutionProvider::GetSupportedList(SubGraphCollection_t, int, int, const onnxruntime::GraphViewer&, bool*) const [ONNXRuntimeError] : 1 : FAIL : TensorRT input: input has no shape specified. Please run shape inference on the onnx model first. Details can be found in https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#shape-inference-for-tensorrt-subgraphs
I'm exec-ed into a running NVIDIA TensorRT docker image.
Expected behavior
I get a running inference session using the TensorRT execution provider.
@sam-h-bean is the error reproducible using only onnxruntime
and inferenceSession
? If so it might make more sense to open an issue in the onnxruntime repo and link to this one
@philschmid I have raised the concerns in the issue above
@philschmid I am also tracking in https://github.com/microsoft/onnxruntime/issues/12133 and https://github.com/microsoft/onnxruntime/issues/12173 but it is becoming unclear if the issue is truly there or if Optimum is creating a quantized model that can not be used by TensorRT when using the TensorRT QuantizationConfig.
@philschmid I am also seeing this weird behavior https://github.com/NVIDIA/TensorRT/issues/2146. I thought this was an oddity of TensorRT but it seems like the same thing is happening when I use your suggested way of putting the pipeline on GPU. If I first have a model on GPU, the second model seems happy with <16GB and will run fine. But if I have an empty GPU and put the model on GPU it consumes all available memory.
After the seq2seq model is converted to onnx, there are three files. When I load it into GPU, there will be an error. ‘CUDA_ERROR_OUT_OF_MEMORY: out of memory’ How can I solve this problem
What model are you trying to load @Amy234543? And can you provide the sizes of the generated onnx files?
What model are you trying to load @Amy234543? And can you provide the sizes of the generated onnx files? m2m100_418M,After pruning, the size of the model is 1.4g, and the generated encoder_ Model.onnx is 888m, decoder_ Model.onnx is 1.48g, decoder_ with_ past_ Model.onnx is 1.41g @NouamaneTazi
Seems like reasonable sizes. Can you provide a script to reproduce the issue? @Amy234543
Seems like reasonable sizes. Can you provide a script to reproduce the issue? @Amy234543 @NouamaneTazi import torch from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer from optimum.onnxruntime import ORTModelForSeq2SeqLM cuda_idx = 0 model_path = './checkpoint/' device = torch.device(f'cuda:{cuda_idx}') tokenizer = M2M100Tokenizer.from_pretrained('./m2m100_418M') tokenizer.src_lang='zh' model = ORTModelForSeq2SeqLM.from_pretrained(model_path) model.to(device)
The sentence "model.to (device)" will report an error "onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 524746752"
Sometimes ,yell this mistake ‘onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudnnStatus_t; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:115 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudnnStatus_t; bool THRW = true] CUDNN failure 1: CUDNN_STATUS_NOT_INITIALIZED ; GPU=0 ; hostname=container-729d1191ae-744a96c1 ; expr=cudnnCreate(&cudnn_handle_);’
I installed onnxruntime-gpu