TensorRT PTQ support for ViT models

Description

I am trying to figure out if TensoRT and the pytorch_quantization module support post-training quantization for vision transformers.

The following piece of code follows the pytorch_quantization docs almost verbatim (with small changes for compatibility):

import torch
import timm
import torchvision
import pytorch_quantization.quant_modules

pytorch_quantization.quant_modules.initialize()

model = timm.create_model("vit_tiny_patch16_224", pretrained=False, num_classes=0)
# or
# model = torchvision.models.vit_b_16()
# model.heads = torch.nn.Identity()
model = model.eval()
data = torch.randn(1, 3, 224, 224)

for name, module in model.named_modules():
    if name.endswith("_quantizer"):
        module.enable_calib()
        module.disable_quant()

model(data)

for name, module in model.named_modules():
    if name.endswith("_quantizer"):
        module.load_calib_amax()
        module.disable_calib()
        module.enable_quant()

with pytorch_quantization.enable_onnx_export():
    torch.onnx.export(
        model,
        data,
        "timm_vit.onnx",
        opset_version=14,
        # opset_version=10, # scaled_dot_product_attention is not supported in opset 10
        # enable_onnx_checker=False, # unexpected keyword argument
    )

After that, I visualize the resulting engine graph with trex:

trex process timm_vit.onnx results

The conversion succeeds, however, the graph barely uses any INT8 operations. I would have expected almost the whole graph to consist of Int8 operators, but instead most edges in the graph are labeled as Float with only a few Int8s.

Is this expected? My understanding was that most operators in transformers were supposed to be quantizable (with the notable exception of LayerNorm and Softmax, which would require special custom layers for quantization).

Relevant Files

vit_tiny_patch16_224 (timm)

timm_vit onnx engine graph json

vit_b_16 (torchvision)

vision_vit onnx engine graph json

Environment

TensorRT Version: 10.0.0.6

NVIDIA GPU: NVIDIA RTX A6000

NVIDIA Driver Version: 535.171.04

CUDA Version: 12.2

CUDNN Version: 8

Operating System: Ubuntu 22.04

Python Version (if applicable): 3.10.12

PyTorch Version (if applicable): 2.3.0

Baremetal or Container (if so, version): nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04 docker container

Jul 11 '24 16:07 ruro

If you use native trt-api to build network, you can ref trtexec --best --onnx=fp32.onnx --dumpLayerInfo --exportLayerInfo=layer.log from layer.log we can get some info.

Jul 14 '24 00:07 lix19937

Also you can follow this sample https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/onnx_ptq

Jul 14 '24 05:07 lix19937

Hi, sorry for the delayed answer. Here are the layer logs for the two models (with --best and with --int8). Although the contents of those files don't look particularly useful to me.

vision_vit_best.log vision_vit_int8.log

timm_vit_best.log timm_vit_int8.log

Also, I am using the simplest stock ViT models (see the reproduction script in the original post), so you should theoretically be able to reproduce my results and get any extra debugging information you need.

Regarding TensorRT-Model-Optimizer, I'll try it, but the current situation is honestly quite annoying. There are too many supposedly "official" (or at least endorsed) ways to do the same thing and most of them either don't work at all, or produce suboptimal results (and they often don't give easily interpretable outputs that could be used to verify that they are doing the right thing).

Here's a non-exhaustive list of supposedly "official" (endorsed by either PyTorch or TensorRT) quantization methods that support post-training quantization of PyTorch models for TensorRT inference:

https://pytorch.org/docs/stable/quantization.html
https://pytorch.org/TensorRT/tutorials/ptq.html
https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/index.html#post-training-quantization
https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html#apply-post-training-quantization-ptq

I think that TensorRT and PyTorch could benefit from concentrating their efforts on a single project instead of duplicating the development efforts.

Jul 22 '24 16:07 ruro

I think you can ref https://github.com/NVIDIA/TensorRT/tree/release/10.2/demo/BERT

Aug 07 '24 15:08 lix19937

I think that TensorRT and PyTorch could benefit from concentrating their efforts on a single project instead of duplicating the development efforts.

PyTorch-Quantization is a toolkit for training and evaluating PyTorch models with simulated quantization. Quantization can be added to the model automatically, or manually, allowing the model to be tuned for accuracy and performance. Quantization is compatible with NVIDIAs high performance integer kernels which leverage integer Tensor Cores. The quantized model can be exported to ONNX and imported by TensorRT 8.0 and later.

PyTorch ao ( Eager Mode Quantization, FX Graph Mode Quantization) At lower level, PyTorch provides a way to represent quantized tensors and perform operations with them. They can be used to directly construct models that perform all or part of the computation in lower precision. Higher-level APIs are provided that incorporate typical workflows of converting FP32 model to lower precision with minimal accuracy loss.

And if you use pytorch qat to get a quant.onnx, which not support by trtexec. (build failed)

Dec 17 '24 08:12 lix19937

@poweiw sorry, but why did you close this?

Unless I am misremembering, all of the suggested links use alternative quantization frameworks instead of the TensorRT pytorch_quantization module.

So the original issue is still relevant. Does the TensorRT pytorch_quantization module support PTQ of ViT models?

If it does - please, reopen this issue. If it doesn't - please, consider documenting this fact somewhere in its documentation.

Thanks.

Feb 11 '25 20:02 ruro

Might have mis-read. Thanks for the response. Let me check and get back to you.

Feb 11 '25 21:02 poweiw

Also, pytorch_quantization will not receive further development as stated here. TensorRT-Model-Optimizer is now the encouraged path.

Feb 11 '25 22:02 poweiw

@ruro you should use Model Optimizer (MO) as @poweiw suggested and then examine the generated ONNX file. You should see Q/DQ operations in "strategic locations" - i.e. places where MO thinks will help performance and keep the accuracy the same. These decisions are based on heuristics so you may need to make changes (due to accuracy loss or performance loss). If you still have issues then share the original ONNX file and the final ONNX file.

Feb 13 '25 12:02 nzmora-nvidia