server
server copied to clipboard
ONNX with TensorRT Optimization (ORT-TRT) Warmup
I have an onnx model that I converted using the symbolic_shape_infer.py script in the documentation here from the TensorRT documentation here.
I then added the code below to the config file to use the onnx with tensorrt optimization.
optimization { execution_accelerators { gpu_execution_accelerator : [ { name : "tensorrt" parameters { key: "precision_mode" value: "FP16" } parameters { key: "max_workspace_size_bytes" value: "1073741824" } }] }}
The model is being hosted in triton and I can send an inference request. The warmup time for the first request takes around 414.126s. The the subsequent requests take 0.2675s. To mitigate this I added warmup requests to the config file.
model_warmup [ { name: "warmup 1" batch_size: 8 inputs: { key: "input" value: { data_type: TYPE_FP32 dims: [3,512,512] random_data: true } } } ]
The model input is as follows:
input [ { name: "input" data_type: TYPE_FP32 dims: [3, -1, -1] } ]
After I add the warmup the model never loads into Triton with the following error:
[E:onnxruntime:log, tensorrt_execution_provider.h:51 log] [2022-04-26 21:11:31 ERROR] 1: [myelinRemoveEmptyTensors.cpp::removeEmptyProducersFromSubgraph::195] Error Code 1: Internal Error (not implemented)
I had 3 question regarding this setup:
- Any suggestion would help in getting the warmup configured for this model to mitigate warmup wait times!
- The tensorrt optimization converts the onnx model to FP16 quantization. Why does the model input remain at FP32? I was able to send a request and get a response with input FP32.
- I would also like to try INT8 quantization in the optimization step above but I get the following error.
[E:onnxruntime:log, tensorrt_execution_provider.h:51 log] [2022-04-26 19:48:19 ERROR] 4: [standardEngineBuilder.cpp::initCalibrationParams::1402] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.)
Any updates on this? Any guidance is greatly appreciated!!!
Without any additional information, one possibility is that your model is data sensitive and thus wouldn't work well with random_data
, please try to provide realistic data via input_data_file
and see if the issue will be resolved.
The tensorrt optimization converts the onnx model to FP16 quantization. Why does the model input remain at FP32? I was able to send a request and get a response with input FP32.
The models are converted to FP16 precision not its I/O. The inputs and outputs retains the datatype.
I would also like to try INT8 quantization in the optimization step above but I get the following error.
Have you read this? You might have to provide calibration table for int8 precision to work. You can read more about it here.
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this.