TensorRT
TensorRT copied to clipboard
INT8 inference in TensorRT 8.0 get wrong answer
I want to convert pytorch model to TensorRT to do INT8 inference, then I do pytorch model -> onnx model -> trt engine, and in TensorRT 7.2.2.3, I succeed.
I set fp_mode and int8_mode with old version as
builder.fp16_mode=True
builder.int8_mode=True
in int8_mode, I feed test data to calibrate, and finally I bulid fp32 engine, fp16 engine, int8 engine, and I get right accuracy in all the three mode.
Now I want to apply QAT model to TensorRT, and I update pytorch to 1.8.0, TensorRT to 8.0, cuda 10.2.89, cudnn 8.2.0,
first I do INT8 inference in TensorRT as above, but the old version for setting fp16 and int8 mode cannot used, so I use as
config.set_flag(trt.BuilderFlag.FP16)
config.set_flag(trt.BuilderFlag.INT8)
I cannot find pytorch int8 inference sample so I reference int8_caffe_mnist sample (https://github.com/NVIDIA/TensorRT/blob/master/samples/python/int8_caffe_mnist) to calibrate my test data with Int8EntropyCalibrator2 method.
config.int8_calibrator = calib
I bulid trt engine successfully and get the right accuracy with fp32 and fp16 mode, but wrong accuracy in in8 mode, 0.01 acc for 100-classification, I check the engine output, and find the matrix values are almost about 0.002.
Then I use pytorch_quantization toolkit, do PTQ
collect_stats(model, data_loader, num_batches=2)
compute_amax(model, method="percentile", percentile=99.99)
and I export model to onnx, paser it, build int8 engine for inference, I also get the 0.01 accuracy.
the difference between int8 mode and fp16 mode is the config.set_flag and config.int8_calibrator in int8 mode.
Why I get right accuracy in fp16 mode but 0.01 acc in int8 mode?
I don't know what's the problem, calibration method? calibration batch? @ttyio appreciate your reply, thanks a lot!
Environment: TensorRT Version: 8.0.0.3 NVIDIA GPU:Tesla P40 NVIDIA Driver Version: CUDA Version: 10.2.89 CUDNN Version: 8.2.0 Operating System: Ubuntu 19.10
Hello @Ricardosuzaku , the steps are correct. what's the accuracy when you run the pytorch_quantization toolkit? thanks
Hello @Ricardosuzaku , the steps are correct. what's the accuracy when you run the pytorch_quantization toolkit? thanks
Hi, @ttyio ,appreciate for you reply. I do PTQ and QAT referred https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization/examples. My model is DenseNet, I make torch model and load my loading weights, do PTQ and then do QAT by the pytorch_quantization, I run the pytorch model after QAT and I get expected acc(85.28%, old model 85.58%). But when I turn it to onnx model and build int8 trt engine to do inference(It seems that because of the QuantConv2d, I can't build fp32 engine), do calibration and I get wrong 0.01 acc, I print the output matrix of test image, and find the mean is about 0.002.
@Ricardosuzaku , seems like there is an implementation issue, could you share the onnx file? thanks
@ttyio The main codes are here, and I send onnx file to your email. Thanks a lot!
import os
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import calib
from pytorch_quantization.tensor_quant import QuantDescriptor
from pytorch_quantization import quant_modules
quant_desc_input = QuantDescriptor(calib_method='histogram')
quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)
quant_modules.initialize()
def collect_stats(model, data_loader, num_batches):
"""Feed data to the network and collect statistic"""
# Enable calibrators
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
# print(name, module._calibrator)
if module._calibrator is not None:
module.disable_quant()
module.enable_calib()
else:
module.disable()
for i, (image, _) in tqdm(enumerate(data_loader), total=num_batches):
model(image.cuda())
if i >= num_batches:
break
# Disable calibrators
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
if module._calibrator is not None:
module.enable_quant()
module.disable_calib()
else:
module.enable()
def compute_amax(model, **kwargs):
# Load calib result
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
if module._calibrator is not None:
if isinstance(module._calibrator, calib.MaxCalibrator):
module.load_calib_amax()
else:
module.load_calib_amax(**kwargs)
# print(F"{name:40}: {module}")
model.cuda()
def test_model_accuracy(fp16_mode=False, int8_mode=False, engine_path=None):
if engine_path:
engine = load_engine(engine_path)
else:
if int8_mode:
test_data = MiniImageNet('test')
dataset = torch.utils.data.DataLoader(test_data, batch_size=_val_batch_size, shuffle=True,
num_workers=_num_workers, pin_memory=True)
max_batch_for_calibartion = 32
transform = None
img_size = (3, 224, 224)
calibration_stream = ImageBatchStreamDemo(dataset, transform, max_batch_for_calibartion, img_size)
calib = EntropyCalibrator(dataset, max_batches=2)
engine = build_engine(onnx_file_path=onnx_file_path, engine_file_path=trt_engine_path, fp16_mode=fp16_mode, int8_mode=int8_mode, calib=calib, save_engine=True)
else:
engine = build_engine(onnx_file_path=onnx_file_path, engine_file_path=trt_engine_path, fp16_mode=fp16_mode, int8_mode=int8_mode, save_engine=True)
context = engine.create_execution_context()
inputs, outputs, bindings, stream = allocate_buffers(engine) # input, output: host # bindings
acc_info = test_model_trt(context, bindings, inputs, outputs, stream)
print('trt accuracy', acc_info)
def test_trt_model_trt(context, bindings, inputs, outputs, stream):
test_data = MiniImageNet('test')
test_loader = torch.utils.data.DataLoader(
test_data,
batch_size=_val_batch_size, shuffle=True,
num_workers=_num_workers, pin_memory=True)
device = 'cuda'
return validate_trt(test_loader, context, bindings, inputs, outputs, stream)
def validate_trt(val_loader, context, bindings, inputs, outputs, stream):
top1_hits = []
top5_hits = []
for images, target in tqdm(val_loader, leave=True, desc='val progress'):
target = target.to(device)
inputs[0].host = images.numpy().reshape(-1)
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
output = postprocess_the_outputs(trt_outputs[0][:images.shape[0] * 100], (images.shape[0], 100))
output = torch.from_numpy(output).to(device)
for ins_output, ins_target in zip(output, target):
top1hit, top5hit = _topk_hit(ins_output, ins_target, topk=(1, 5))
top1_hits.append(top1hit)
top5_hits.append(top5hit)
top1_prec = len([hit for hit in top1_hits if hit]) / len(top1_hits)
top5_prec = len([hit for hit in top5_hits if hit]) / len(top5_hits)
return {
'top1_prec': top1_prec,
'top5_prec': top5_prec
}
def build_engine(onnx_file_path="", engine_file_path="", fp16_mode=False, int8_mode=False,
save_engine=False, calib=None, TRT_LOGGER=trt.Logger()):
"""Takes an ONNX file and creates a TensorRT engine to run inference with"""
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, \
builder.create_builder_config() as config, \
trt.OnnxParser(network, TRT_LOGGER) as parser:
config.max_workspace_size = 1 << 32 # 2048MiB
builder.max_batch_size = 1
if fp16_mode:
config.set_flag(trt.BuilderFlag.FP16)
elif int8_mode:
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = calib
else:
pass
# config.set_flag(trt.BuilderFlag.SPARSE_WEIGHTS)
# Parse model file
with open(onnx_file_path, 'rb') as model:
print('Beginning ONNX file parsing')
if not parser.parse(model.read()):
print('ERROR: Failed to parse the ONNX file.')
for error in range(parser.num_errors):
print(parser.get_error(error))
return None
# The actual yolov3.onnx is generated with batch size 64. Reshape input to batch size 1
network.get_input(0).shape = [32, 3, 224, 224]
print('Completed parsing of ONNX file')
print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
engine = builder.build_engine(network, config)
print("Completed creating Engine")
if save_engine:
with open(engine_file_path, "wb") as f:
f.write(engine.serialize())
return engine
def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
stream.synchronize()
return [out.host for out in outputs]
class EntropyCalibrator(trt.IInt8EntropyCalibrator2):
def __init__(self, dataset, cache_file=" ", batch_size=32, max_batches=2):
# Whenever you specify a custom constructor for a TensorRT class,
# you MUST call the constructor of the parent explicitly.
trt.IInt8EntropyCalibrator2.__init__(self)
self.cache_file = cache_file
self.max_batches = max_batches
# Every time get_batch is called, the next batch of size batch_size will be copied to the device and returned.
self.dataset = dataset
self.batch_size = batch_size
self.current_index = 0
self.batch_count = 0
# Allocate enough memory for a whole batch.
self.data = np.zeros((max_batches, 32, 3 , 224, 224))
for k, (images, targets) in enumerate(self.dataset):
if k >= self.max_batches: break
self.data[k] = images.numpy()
self.device_input = cuda.mem_alloc(self.data[0].nbytes)
def get_batch_size(self):
return self.batch_size
def get_batch(self, names):
if self.batch_count < self.max_batches:
batch = self.data[self.batch_count].ravel()
cuda.memcpy_htod(self.device_input, batch)
self.batch_count += 1
return [self.device_input]
else:
return None
def read_calibration_cache(self):
if os.path.exists(self.cache_file):
with open(self.cache_file, "rb") as f:
return f.read()
def write_calibration_cache(self, cache):
return None
if __name__ == '__main__':
device = 'cuda'
model = densenet.make_model(121).cuda()
model.load_state_dict(torch.load('../pretrained/Densnet121_base_pre_Tea.pkl', map_location='cuda:0'))
model.eval()
with torch.no_grad():
collect_stats(model, train_loader, num_batches=10)
compute_amax(model, method="percentile", percentile=99.99)
densenet.finetune_model(model)
with torch.no_grad():
print(test_model_gpu(model)) # the accuracy is 0.854
fp16_mode = False
int8_mode = True
# trt_engine_path = './engine/trt8.0_model_fp16_False_int8_True_.trt'
trt_engine_path = './engine/trt8.0_model_fp16_{}_int8_{}.trt'.format(fp16_mode, int8_mode)
batch_size = 32
onnx_file_path = './onnx_model/densenet121_batch' + str(batch_size) + '.onnx'
d_input = Variable(torch.randn(batch_size, 3, 224, 224)).cuda()
torch.onnx.export(model, d_input, onnx_file_path, input_names=['input'], output_names=['output'], verbose=False, opset_version=13)
test_trt_model_accuracy(fp16_mode=fp16_mode, int8_mode=int8_mode)
Thanks @Ricardosuzaku , confirmed this is TRT implementation issue and I have submit fix internally, and the fix will be available in the 8.0GA.
Thanks @Ricardosuzaku , confirmed this is TRT implementation issue and I have submit fix internally, and the fix will be available in the 8.0GA.
@ttyio Thanks a lot! But I have the same problem even I don't implement PTQ and QAT, I just apply a pytorch model to onnx and build it to trt engine as above, I also get 0.01 acc, is there any problem in my code, especially the steps of building the engine?
@Ricardosuzaku , when the onnx model contains Q/DQ, then you have to enable INT8 when import the ONNX, and there is no need to setup calibrator since the scale is always in Q/DQ; when the onnx model contains no Q/DQ, then to run INT8, you need both enable INT8 and setup calibrator;
Could you elaborate more detail on your 0.01 acc? Is the onnx contains Q/DQ, did you enable INT8? did you setup calibrator? thanks!
@ttyio Yes I know it! At the beginning, my model don't contain Q/DQ, I run it in INT8 and setup calibrator, I get 0.01 acc, as the function build_engine and class EntropyCalibrator in my code above. But when I run it with fp32 or fp16 engine, I can get right acc. Then I use pytorch_quantizaiton and add Q/DQ, no matter if I setup calibrator or not, I get 0.01 acc.
Hello @Ricardosuzaku ,
with the line
quant_modules.initialize()
in the beginning of script, the created denseNet model in the later code will replaced conv/gemm with Q/DQ before it. Could you take a check of the ONNX, or did you use other script for the PTQ testing? thanks
@ttyio, thanks.
Sorry for not expressing what I mean.
Firstly I'm sure I don't apply pytorch_quantizaiton, and I don't add the code quant_modules.initialize()
.
I just run it in TensorRT 7.2, before turn it to onnx(with opset10), I print my model and ensure there is no Q/DQ, I got right answer in fp32 mode and fp16mode, but 0.01 acc in int8 mode.
@Ricardosuzaku , got it, Could you provide the verbose build log for the passed 7.2 and failed 8.0ea? thanks!
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
@Ricardosuzaku , got it, Could you provide the verbose build log for the passed 7.2 and failed 8.0ea? thanks!
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
Then how can I write the TRT_LOGGER to a txt file, I can just see it in console.
@Ricardosuzaku , could you redirect to a file like:
python builder.py > build.log
Thanks
Hi, @ttyio I have sent the logs to your email, thanks a lot!
Hi @Ricardosuzaku , thanks for the log
but I did not see regression between TRT7.2_opset9_int8_True_calib.log
and TRT8.0_opset13_int8_True_calib.log
Both engine has
trt accuracy {'top1_prec': 0.01, 'top5_prec': 0.05}
I thought you have the correct accuracy in 7.2 PTQ, right?
Hi, @ttyio, Thanks for your reply. I just implement PTQ in trt 8.0. I always get wrong answer when I set config.set_flag(trt.BuilderFlag.INT8), no matter in 7.2 or 8.0. I get right acc when I use builder.int8_mode=True in 7.2 . But I do PTQ and QAT then run it in pytorch model, I can get right acc, so I think the problem is the code config.set_flag(trt.BuilderFlag.INT8)?Or my steps is wrong when I set config int8 flag.
------------------ 原始邮件 ------------------ 发件人: "NVIDIA/TensorRT" @.>; 发送时间: 2021年6月10日(星期四) 下午3:48 @.>; @.@.>; 主题: Re: [NVIDIA/TensorRT] INT8 inference in TensorRT 8.0 get wrong answer (#1289)
Hi @Ricardosuzaku , thanks for the log
but I did not see regression between TRT7.2_opset9_int8_True_calib.log and TRT8.0_opset13_int8_True_calib.log
Both engine has
trt accuracy {'top1_prec': 0.01, 'top5_prec': 0.05}
I thought you have the correct accuracy in 7.2 PTQ, right?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
@Ricardosuzaku , the code is correct, the new API behavior the same internally, do you have the 7.2 log that use builder.int8_mode=True
?
Sorry for late, here is the 7.2 log that use builder.int8_mode=True.I get right acc in it.
------------------ 原始邮件 ------------------ 发件人: "NVIDIA/TensorRT" @.>; 发送时间: 2021年6月10日(星期四) 下午4:12 @.>; @.@.>; 主题: Re: [NVIDIA/TensorRT] INT8 inference in TensorRT 8.0 get wrong answer (#1289)
@Ricardosuzaku , the code is correct, the new API behavior the same internally, do you have the 7.2 log that use builder.int8_mode=True?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
sorry @Ricardosuzaku , I still not cannot see the log
sorry @Ricardosuzaku , I still not cannot see the log
Could I say that 7.2 have fixed the problem of INT8 when compared to 7.1.3.4 on YOLOv5 model?
@PowerDi we did not see any functional bug for YOLOv5 INT8. Could you elaborate more? Thanks
@ttyio I sent 7.2 logs with builder.int8_mode=True
to your hotmail a few hours ago, do you received that?
@PowerDi we did not see any functional bug for YOLOv5 INT8. Could you elaborate more? Thanks
Yes. In fact we perform Hardwish in th way: https://github.com/enazoe/yolo-tensorrt/blob/be3859be606b4e2cfc86f835e424d0df6018e18c/modules/hardswish.cu#L45-L61 And we add the model layer by layer with
https://github.com/enazoe/yolo-tensorrt/blob/be3859be606b4e2cfc86f835e424d0df6018e18c/modules/trt_utils.cpp#L820-L825
And INT8 is work for our work.
But in onnx way, we export onnx model in this way:
class Hardswish(nn.Module): # export-friendly version of nn.Hardswish()
@staticmethod
def forward(x):
# return x * F.hardsigmoid(x) # for torchscript and CoreML
return x * F.hardtanh(x + 3, 0., 6.) / 6. # for torchscript, CoreML and ONNX
However, in this way the function is sensitive on int8 and can not get the right result. We are confused because the calculation seems to be similar.
@ttyio Hi ttyio, I use trt8.0 EA's QAT to train the model which is like U-Net ,but when I try to convert onnx to trt.It' always show Segmentation fault which means quanted model takes more memory when build engine(without quant model can build succeed). I have try to increase max_workspace_size,but it's useless. I would like to ask whether the model needs more memory space when building the engine after adding quantization———maybe need double? Besides , during I increase max_workspace_size, Another problem arose ———No Type can't be serialize(It's happen sometimes)
I think you need set dynamic range for every layer. https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleINT8API I want to kown how to generate per_tensor_dynamic_ranges.txt.
@Ricardosuzaku hello, I have the same problem. Your problem is solved?
@Ricardosuzaku @aojue1109 Does this issue still exist with latest TRT release? If it does, we will debug it. Thanks
@ttyio @nvpohanh I have the same problem, but I get correct results from the ONNX (QAT) and wrong results when trying to infer in TensorRT8.2. More info: QAT with torch 1.11 ONNX 1.9.0 (opset 12 & 13) TensorRT 8.2.2.1 (INT8)
The accuracy dropped significantly in INT8 mode! Accuracy is around %9 but in ONNX is %96.5.
@Mahsa1994 Could you share your ONNX file? Thanks
@nvpohanh Here is a link to the QAT model in pytorch (ResNet50 - pretrained - 7clss): ResNet50.pth It's trained with mentioned versions above.