TensorRT
TensorRT copied to clipboard
TensorRT 8.6.2 MatrixMultiply Operator Quantization
I am performing QAT quantization on the HRNet OCR model and using TensorRT 8.6.2 to convert and quantize the generated ONNX model with QDQ operations. After conversion, I found that the MatrixMultiply operator was not quantized to INT8. As shown in the figure below.
Then, I manually inserted QDQ operators between the two matrices being multiplied, and after conversion, the MatrixMultiply operator was successfully quantized to INT8.
However, an issue occurred: the conversion resulted in the INT8 version of MatrixMultiply taking more time than the original FP16 version. As shown in the figure below, the first bar represents the FP16 execution time, and the second bar represents the INT8 execution time.
"Why is this the case?"
"Moreover, I found on the official website that MatrixMultiply does not support INT8. Why is it that after I manually inserted the QDQ nodes, it can be quantized to INT8?"