TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

using QDQonnx export engine file,fp16and int8,speed is not faster than using ONNX withouout QDQ to export fp16 engine

Open DDDaar opened this issue 7 months ago • 8 comments

Hello! I used mtq.quantize to quantize the RFDETR model and exported the QDQ ONNX. However, after converting it to an engine, the models exported with --int8 and --fp16, although smaller in size than the original fp16 engine file, have almost the same inference speed. Could you please suggest any solutions? Thank you.

DDDaar avatar Aug 13 '25 07:08 DDDaar

Can you provide the verbose logs to see if INT8 tactics are being chose by TensorRT?

kevinch-nv avatar Aug 14 '25 22:08 kevinch-nv

Can you provide the verbose logs to see if INT8 tactics are being chose by TensorRT?

Thank you for your reply.Here are verbose logs of 2 kind of engines------1、qat model using --int8 and --fp16 2、original model using --fp16

ori_fp16.log

qat_fp16_int8.log

DDDaar avatar Aug 15 '25 03:08 DDDaar

supplement : When input image size becoming larger,performence on qat(--fp16) model becomes even slower than original model(fp16),only a little faster than fp32 engine model.All the experiment are completed on RTX 4090 GPU.

DDDaar avatar Aug 15 '25 05:08 DDDaar

Can you provide the verbose logs to see if INT8 tactics are being chose by TensorRT?

Sorry to bother you, but I was wondering if you're still available. The issue that the inference speed of the engine after int8 quantization does not improve (or even decreases) compared to directly converting the original model to fp16 for inference has really been bothering me. Could you please explain the reason behind this? If any additional information is needed, I can provide it at any time. Thank you.

DDDaar avatar Aug 17 '25 14:08 DDDaar

I have the same question.

cslvjt avatar Aug 19 '25 10:08 cslvjt

I think the extra time is due to the introduction of quantization and dequantization layers。

Image

@DDDaar

cslvjt avatar Aug 19 '25 10:08 cslvjt

I think the extra time is due to the introduction of quantization and dequantization layers。

Image [@DDDaar](https://github.com/DDDaar)

So what methods are available to achieve acceleration? If the conversion time between int8 and fp32 (or fp16) is excessively long, the goal of acceleration will not be achieved.Thanks

DDDaar avatar Aug 19 '25 12:08 DDDaar

I think the extra time is due to the introduction of quantization and dequantization layers。 Image [@DDDaar](https://github.com/DDDaar)

So what methods are available to achieve acceleration? If the conversion time between int8 and fp32 (or fp16) is excessively long, the goal of acceleration will not be achieved.Thanks

Hello, you can check this solution, i only know this solution for now : https://github.com/NVIDIA/Model-Optimizer/issues/207

Floriangit12 avatar Dec 10 '25 15:12 Floriangit12