TensorRT INT8 inference in TensorRT 8.0 get wrong answer

I want to convert pytorch model to TensorRT to do INT8 inference, then I do pytorch model -> onnx model -> trt engine, and in TensorRT 7.2.2.3, I succeed. I set fp_mode and int8_mode with old version as builder.fp16_mode=True builder.int8_mode=True in int8_mode, I feed test data to calibrate, and finally I bulid fp32 engine, fp16 engine, int8 engine, and I get right accuracy in all the three mode.

Now I want to apply QAT model to TensorRT, and I update pytorch to 1.8.0, TensorRT to 8.0, cuda 10.2.89, cudnn 8.2.0, first I do INT8 inference in TensorRT as above, but the old version for setting fp16 and int8 mode cannot used, so I use as config.set_flag(trt.BuilderFlag.FP16) config.set_flag(trt.BuilderFlag.INT8) I cannot find pytorch int8 inference sample so I reference int8_caffe_mnist sample (https://github.com/NVIDIA/TensorRT/blob/master/samples/python/int8_caffe_mnist) to calibrate my test data with Int8EntropyCalibrator2 method. config.int8_calibrator = calib I bulid trt engine successfully and get the right accuracy with fp32 and fp16 mode, but wrong accuracy in in8 mode, 0.01 acc for 100-classification, I check the engine output, and find the matrix values are almost about 0.002.

Then I use pytorch_quantization toolkit, do PTQ collect_stats(model, data_loader, num_batches=2) compute_amax(model, method="percentile", percentile=99.99) and I export model to onnx, paser it, build int8 engine for inference, I also get the 0.01 accuracy. the difference between int8 mode and fp16 mode is the config.set_flag and config.int8_calibrator in int8 mode. Why I get right accuracy in fp16 mode but 0.01 acc in int8 mode?

I don't know what's the problem, calibration method? calibration batch? @ttyio appreciate your reply, thanks a lot!

Environment: TensorRT Version: 8.0.0.3 NVIDIA GPU:Tesla P40 NVIDIA Driver Version: CUDA Version: 10.2.89 CUDNN Version: 8.2.0 Operating System: Ubuntu 19.10

Jun 03 '21 09:06 Ricardosuzaku

Hello @Ricardosuzaku , the steps are correct. what's the accuracy when you run the pytorch_quantization toolkit? thanks

Jun 07 '21 12:06 ttyio

Hello @Ricardosuzaku , the steps are correct. what's the accuracy when you run the pytorch_quantization toolkit? thanks

Hi, @ttyio ,appreciate for you reply. I do PTQ and QAT referred https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization/examples. My model is DenseNet, I make torch model and load my loading weights, do PTQ and then do QAT by the pytorch_quantization, I run the pytorch model after QAT and I get expected acc(85.28%, old model 85.58%). But when I turn it to onnx model and build int8 trt engine to do inference(It seems that because of the QuantConv2d, I can't build fp32 engine), do calibration and I get wrong 0.01 acc, I print the output matrix of test image, and find the mean is about 0.002.

Jun 08 '21 01:06 Ricardosuzaku

@Ricardosuzaku , seems like there is an implementation issue, could you share the onnx file? thanks

Jun 08 '21 02:06 ttyio

@ttyio The main codes are here, and I send onnx file to your email. Thanks a lot!

import os
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import calib
from pytorch_quantization.tensor_quant import QuantDescriptor
from pytorch_quantization import quant_modules

quant_desc_input = QuantDescriptor(calib_method='histogram')
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)
quant_modules.initialize()

def collect_stats(model, data_loader, num_batches):
    """Feed data to the network and collect statistic"""
    # Enable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            # print(name, module._calibrator)
            if module._calibrator is not None:
                module.disable_quant()
                module.enable_calib()
            else:
                module.disable()
    for i, (image, _) in tqdm(enumerate(data_loader), total=num_batches):
        model(image.cuda())
        if i >= num_batches:
            break
    # Disable calibrators
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                module.enable_quant()
                module.disable_calib()
            else:
                module.enable()

def compute_amax(model, **kwargs):
    # Load calib result
    for name, module in model.named_modules():
        if isinstance(module, quant_nn.TensorQuantizer):
            if module._calibrator is not None:
                if isinstance(module._calibrator, calib.MaxCalibrator):
                    module.load_calib_amax()
                else:
                    module.load_calib_amax(**kwargs)
    #             print(F"{name:40}: {module}")
    model.cuda()

def test_model_accuracy(fp16_mode=False, int8_mode=False, engine_path=None):
    if engine_path:
        engine = load_engine(engine_path)
    else:
        if int8_mode:
            test_data = MiniImageNet('test')
            dataset = torch.utils.data.DataLoader(test_data, batch_size=_val_batch_size, shuffle=True,
                                                  num_workers=_num_workers, pin_memory=True)
            max_batch_for_calibartion = 32
            transform = None
            img_size = (3, 224, 224)
            calibration_stream = ImageBatchStreamDemo(dataset, transform, max_batch_for_calibartion, img_size)
            calib = EntropyCalibrator(dataset, max_batches=2)
            engine = build_engine(onnx_file_path=onnx_file_path, engine_file_path=trt_engine_path, fp16_mode=fp16_mode, int8_mode=int8_mode, calib=calib, save_engine=True)
        else:
            engine = build_engine(onnx_file_path=onnx_file_path, engine_file_path=trt_engine_path, fp16_mode=fp16_mode, int8_mode=int8_mode, save_engine=True)
    context = engine.create_execution_context()
    inputs, outputs, bindings, stream = allocate_buffers(engine)  # input, output: host # bindings
    acc_info = test_model_trt(context, bindings, inputs, outputs, stream)
    print('trt accuracy', acc_info)

def test_trt_model_trt(context, bindings, inputs, outputs, stream):
    test_data = MiniImageNet('test')
    test_loader = torch.utils.data.DataLoader(
        test_data,
        batch_size=_val_batch_size, shuffle=True,
        num_workers=_num_workers, pin_memory=True)
    device = 'cuda'
    return validate_trt(test_loader, context, bindings, inputs, outputs, stream)

def validate_trt(val_loader, context, bindings, inputs, outputs, stream):
    top1_hits = []
    top5_hits = []
    for images, target in tqdm(val_loader, leave=True, desc='val progress'):
        target = target.to(device)
        inputs[0].host = images.numpy().reshape(-1)
        trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
        output = postprocess_the_outputs(trt_outputs[0][:images.shape[0] * 100], (images.shape[0], 100))
        output = torch.from_numpy(output).to(device)
        for ins_output, ins_target in zip(output, target):
            top1hit, top5hit = _topk_hit(ins_output, ins_target, topk=(1, 5))
            top1_hits.append(top1hit)
            top5_hits.append(top5hit)
    top1_prec = len([hit for hit in top1_hits if hit]) / len(top1_hits)
    top5_prec = len([hit for hit in top5_hits if hit]) / len(top5_hits)
    return {
        'top1_prec': top1_prec,
        'top5_prec': top5_prec
    }

def build_engine(onnx_file_path="", engine_file_path="", fp16_mode=False, int8_mode=False,
                 save_engine=False, calib=None, TRT_LOGGER=trt.Logger()):
    """Takes an ONNX file and creates a TensorRT engine to run inference with"""
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, \
            builder.create_builder_config() as config, \
            trt.OnnxParser(network, TRT_LOGGER) as parser:

        config.max_workspace_size = 1 << 32  # 2048MiB
        builder.max_batch_size = 1
        if fp16_mode:
            config.set_flag(trt.BuilderFlag.FP16)
        elif int8_mode:
            config.set_flag(trt.BuilderFlag.INT8)
            config.int8_calibrator = calib
        else:
            pass
            # config.set_flag(trt.BuilderFlag.SPARSE_WEIGHTS)
        # Parse model file
        with open(onnx_file_path, 'rb') as model:
            print('Beginning ONNX file parsing')
            if not parser.parse(model.read()):
                print('ERROR: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None
        # The actual yolov3.onnx is generated with batch size 64. Reshape input to batch size 1
        network.get_input(0).shape = [32, 3, 224, 224]
        print('Completed parsing of ONNX file')
        print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
        engine = builder.build_engine(network, config)
        print("Completed creating Engine")
        if save_engine:
            with open(engine_file_path, "wb") as f:
                f.write(engine.serialize())
        return engine

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    stream.synchronize()
    return [out.host for out in outputs]

class EntropyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, dataset, cache_file=" ", batch_size=32, max_batches=2):
        # Whenever you specify a custom constructor for a TensorRT class,
        # you MUST call the constructor of the parent explicitly.
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.cache_file = cache_file
        self.max_batches = max_batches
        # Every time get_batch is called, the next batch of size batch_size will be copied to the device and returned.
        self.dataset = dataset
        self.batch_size = batch_size
        self.current_index = 0
        self.batch_count = 0
        # Allocate enough memory for a whole batch.
        self.data = np.zeros((max_batches, 32, 3 , 224, 224))
        for k, (images, targets) in enumerate(self.dataset):
            if k >= self.max_batches: break
            self.data[k] = images.numpy()
        self.device_input = cuda.mem_alloc(self.data[0].nbytes)

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        if self.batch_count < self.max_batches:
            batch = self.data[self.batch_count].ravel()
            cuda.memcpy_htod(self.device_input, batch)
            self.batch_count += 1
            return [self.device_input]
        else:
            return None

    def read_calibration_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        return None

if __name__ == '__main__':

    device = 'cuda'
    model = densenet.make_model(121).cuda()
    model.load_state_dict(torch.load('../pretrained/Densnet121_base_pre_Tea.pkl', map_location='cuda:0'))
    model.eval()
    with torch.no_grad():
        collect_stats(model, train_loader, num_batches=10)
        compute_amax(model, method="percentile", percentile=99.99)
    densenet.finetune_model(model)
    with torch.no_grad():
        print(test_model_gpu(model)) # the accuracy is 0.854

    fp16_mode = False
    int8_mode = True
    # trt_engine_path = './engine/trt8.0_model_fp16_False_int8_True_.trt'
    trt_engine_path = './engine/trt8.0_model_fp16_{}_int8_{}.trt'.format(fp16_mode, int8_mode)
    batch_size = 32
    onnx_file_path = './onnx_model/densenet121_batch' + str(batch_size) + '.onnx'
    d_input = Variable(torch.randn(batch_size, 3, 224, 224)).cuda()
    torch.onnx.export(model, d_input, onnx_file_path, input_names=['input'], output_names=['output'], verbose=False, opset_version=13)
    test_trt_model_accuracy(fp16_mode=fp16_mode, int8_mode=int8_mode)

Jun 08 '21 08:06 Ricardosuzaku

Thanks @Ricardosuzaku , confirmed this is TRT implementation issue and I have submit fix internally, and the fix will be available in the 8.0GA.

Jun 09 '21 14:06 ttyio

Thanks @Ricardosuzaku , confirmed this is TRT implementation issue and I have submit fix internally, and the fix will be available in the 8.0GA.

@ttyio Thanks a lot! But I have the same problem even I don't implement PTQ and QAT, I just apply a pytorch model to onnx and build it to trt engine as above, I also get 0.01 acc, is there any problem in my code, especially the steps of building the engine?

Jun 09 '21 15:06 Ricardosuzaku

@Ricardosuzaku , when the onnx model contains Q/DQ, then you have to enable INT8 when import the ONNX, and there is no need to setup calibrator since the scale is always in Q/DQ; when the onnx model contains no Q/DQ, then to run INT8, you need both enable INT8 and setup calibrator;

Could you elaborate more detail on your 0.01 acc? Is the onnx contains Q/DQ, did you enable INT8? did you setup calibrator? thanks!

Jun 10 '21 00:06 ttyio

@ttyio Yes I know it! At the beginning, my model don't contain Q/DQ, I run it in INT8 and setup calibrator, I get 0.01 acc, as the function build_engine and class EntropyCalibrator in my code above. But when I run it with fp32 or fp16 engine, I can get right acc. Then I use pytorch_quantizaiton and add Q/DQ, no matter if I setup calibrator or not, I get 0.01 acc.

Jun 10 '21 01:06 Ricardosuzaku

Hello @Ricardosuzaku ,

with the line

  quant_modules.initialize()

in the beginning of script, the created denseNet model in the later code will replaced conv/gemm with Q/DQ before it. Could you take a check of the ONNX, or did you use other script for the PTQ testing? thanks

Jun 10 '21 01:06 ttyio

@ttyio, thanks. Sorry for not expressing what I mean. Firstly I'm sure I don't apply pytorch_quantizaiton, and I don't add the code quant_modules.initialize(). I just run it in TensorRT 7.2, before turn it to onnx(with opset10), I print my model and ensure there is no Q/DQ, I got right answer in fp32 mode and fp16mode, but 0.01 acc in int8 mode.

Jun 10 '21 02:06 Ricardosuzaku

@Ricardosuzaku , got it, Could you provide the verbose build log for the passed 7.2 and failed 8.0ea? thanks!

    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)

Jun 10 '21 02:06 ttyio

@Ricardosuzaku , got it, Could you provide the verbose build log for the passed 7.2 and failed 8.0ea? thanks!
    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)

Then how can I write the TRT_LOGGER to a txt file, I can just see it in console.

Jun 10 '21 03:06 Ricardosuzaku

@Ricardosuzaku , could you redirect to a file like:

     python builder.py > build.log

Thanks

Jun 10 '21 03:06 ttyio

Hi, @ttyio I have sent the logs to your email, thanks a lot!

Jun 10 '21 07:06 Ricardosuzaku

Hi @Ricardosuzaku , thanks for the log but I did not see regression between TRT7.2_opset9_int8_True_calib.log and TRT8.0_opset13_int8_True_calib.log Both engine has

     trt accuracy {'top1_prec': 0.01, 'top5_prec': 0.05}

I thought you have the correct accuracy in 7.2 PTQ, right?

Jun 10 '21 07:06 ttyio

Hi, @ttyio, Thanks for your reply. I just implement PTQ in trt 8.0. I always get wrong answer when I set config.set_flag(trt.BuilderFlag.INT8), no matter in 7.2 or 8.0. I get right acc when I use builder.int8_mode=True in 7.2 . But I do PTQ and QAT then run it in pytorch model, I can get right acc, so I think the problem is the code config.set_flag(trt.BuilderFlag.INT8)?Or my steps is wrong when I set config int8 flag.

------------------ 原始邮件 ------------------ 发件人: "NVIDIA/TensorRT" @.>; 发送时间: 2021年6月10日(星期四) 下午3:48 @.>; @.@.>; 主题: Re: [NVIDIA/TensorRT] INT8 inference in TensorRT 8.0 get wrong answer (#1289)

Hi @Ricardosuzaku , thanks for the log but I did not see regression between TRT7.2_opset9_int8_True_calib.log and TRT8.0_opset13_int8_True_calib.log Both engine has trt accuracy {'top1_prec': 0.01, 'top5_prec': 0.05}
I thought you have the correct accuracy in 7.2 PTQ, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Jun 10 '21 08:06 Ricardosuzaku

@Ricardosuzaku , the code is correct, the new API behavior the same internally, do you have the 7.2 log that use builder.int8_mode=True?

Jun 10 '21 08:06 ttyio

Sorry for late, here is the 7.2 log that use builder.int8_mode=True.I get right acc in it.

------------------ 原始邮件 ------------------ 发件人: "NVIDIA/TensorRT" @.>; 发送时间: 2021年6月10日(星期四) 下午4:12 @.>; @.@.>; 主题: Re: [NVIDIA/TensorRT] INT8 inference in TensorRT 8.0 get wrong answer (#1289)

@Ricardosuzaku , the code is correct, the new API behavior the same internally, do you have the 7.2 log that use builder.int8_mode=True?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Jun 10 '21 09:06 Ricardosuzaku

sorry @Ricardosuzaku , I still not cannot see the log

Jun 11 '21 01:06 ttyio

sorry @Ricardosuzaku , I still not cannot see the log

Could I say that 7.2 have fixed the problem of INT8 when compared to 7.1.3.4 on YOLOv5 model?

Jun 11 '21 03:06 PowerDi

@PowerDi we did not see any functional bug for YOLOv5 INT8. Could you elaborate more? Thanks

Jun 11 '21 07:06 ttyio

@ttyio I sent 7.2 logs with builder.int8_mode=True to your hotmail a few hours ago, do you received that?

Jun 11 '21 07:06 Ricardosuzaku

@PowerDi we did not see any functional bug for YOLOv5 INT8. Could you elaborate more? Thanks

Yes. In fact we perform Hardwish in th way: https://github.com/enazoe/yolo-tensorrt/blob/be3859be606b4e2cfc86f835e424d0df6018e18c/modules/hardswish.cu#L45-L61 And we add the model layer by layer with

https://github.com/enazoe/yolo-tensorrt/blob/be3859be606b4e2cfc86f835e424d0df6018e18c/modules/trt_utils.cpp#L820-L825

And INT8 is work for our work.

But in onnx way, we export onnx model in this way:

class Hardswish(nn.Module):  # export-friendly version of nn.Hardswish()
    @staticmethod
    def forward(x):
        # return x * F.hardsigmoid(x)  # for torchscript and CoreML
        return x * F.hardtanh(x + 3, 0., 6.) / 6.  # for torchscript, CoreML and ONNX

However, in this way the function is sensitive on int8 and can not get the right result. We are confused because the calculation seems to be similar.

Jun 15 '21 01:06 PowerDi

@ttyio Hi ttyio, I use trt8.0 EA's QAT to train the model which is like U-Net ,but when I try to convert onnx to trt.It' always show Segmentation fault which means quanted model takes more memory when build engine(without quant model can build succeed). I have try to increase max_workspace_size，but it's useless. I would like to ask whether the model needs more memory space when building the engine after adding quantization———maybe need double? Besides , during I increase max_workspace_size, Another problem arose ———No Type can't be serialize(It's happen sometimes)

Jun 24 '21 10:06 sheehan-dsh

I think you need set dynamic range for every layer. https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleINT8API I want to kown how to generate per_tensor_dynamic_ranges.txt.

Jul 22 '21 09:07 facerless

@Ricardosuzaku hello, I have the same problem. Your problem is solved?

Dec 30 '21 06:12 aojue1109

@Ricardosuzaku @aojue1109 Does this issue still exist with latest TRT release? If it does, we will debug it. Thanks

Jun 15 '22 10:06 nvpohanh

@ttyio @nvpohanh I have the same problem, but I get correct results from the ONNX (QAT) and wrong results when trying to infer in TensorRT8.2. More info: QAT with torch 1.11 ONNX 1.9.0 (opset 12 & 13) TensorRT 8.2.2.1 (INT8)

The accuracy dropped significantly in INT8 mode! Accuracy is around %9 but in ONNX is %96.5.

Jun 19 '22 08:06 Mahsa1994

@Mahsa1994 Could you share your ONNX file? Thanks

Jun 24 '22 06:06 nvpohanh

@nvpohanh Here is a link to the QAT model in pytorch (ResNet50 - pretrained - 7clss): ResNet50.pth It's trained with mentioned versions above.

Jun 27 '22 07:06 Mahsa1994