TensorRT
TensorRT copied to clipboard
Accuracy loss of TensorRT 8.6 when running INT8 Quantized Resnet18 on GPU A4000
Description
When performing Resnet18 PTQ using TRT-modelopt, I encountered the following issue when compiling the model with TRT.
First off, I started with a pretrained resnet18 from torchvision. I replaced the last fully connected layer to fit on my dataset (for example, CIFRA-10). I also updated all the skip layers (the plus) with a ElementwiseAdd layer and I defined its quantization layer as follow myself (code attached at the end). The reason I do this is to facilitate the Q/DQ fusion so that every layer can be in INT8.
Then, when compiling the exported onnx model with TRT, I found that TRT outputs is very different from the fake Q/DQ model in python, and the fake Q/DQ onnx model as well when running with onnx runtime. (np.allclose with 1e-3 as the threshold failed). Comparing TRT and native output, the classification result disagrees for ~2.3%
I discussed with TRT modelopt in this issue and they suggested to file a bug report here
Environment
TensorRT Version: 8.6.1
NVIDIA GPU: A4000
NVIDIA Driver Version: 535.183.01
CUDA Version: 12.2
Python Version (if applicable): 3.10.1
PyTorch Version (if applicable): '2.4.0+cu124'
Relevant Files
Model link: You can download the onnx model and the TRT engine here: https://file.io/GnuiEMNeebQ1
Steps To Reproduce
Run the TRT model using Python API and the onnx model with Cifar-10 datasets using the following data loader, and compares the result.
testset = datasets.CIFAR10(root='./data', download=True, train=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=1, shuffle=False)
Have you tried the latest release?: Haven't tried TRT10, but we don't plan to upgrade in the short period. I was under the impression 8.6 should be okay.
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):
Yes, Onnx Runtime generates 1% disagreement in the Native model
Appendix
Visualizing the TRT engine, I think it is completely within my expectation with everything being fused as INT8 kernels.