using QDQonnx export engine file,fp16and int8,speed is not faster than using ONNX withouout QDQ to export fp16 engine
Hello! I used mtq.quantize to quantize the RFDETR model and exported the QDQ ONNX. However, after converting it to an engine, the models exported with --int8 and --fp16, although smaller in size than the original fp16 engine file, have almost the same inference speed. Could you please suggest any solutions? Thank you.
Can you provide the verbose logs to see if INT8 tactics are being chose by TensorRT?
Can you provide the verbose logs to see if INT8 tactics are being chose by TensorRT?
Thank you for your reply.Here are verbose logs of 2 kind of engines------1、qat model using --int8 and --fp16 2、original model using --fp16
supplement : When input image size becoming larger,performence on qat(--fp16) model becomes even slower than original model(fp16),only a little faster than fp32 engine model.All the experiment are completed on RTX 4090 GPU.
Can you provide the verbose logs to see if INT8 tactics are being chose by TensorRT?
Sorry to bother you, but I was wondering if you're still available. The issue that the inference speed of the engine after int8 quantization does not improve (or even decreases) compared to directly converting the original model to fp16 for inference has really been bothering me. Could you please explain the reason behind this? If any additional information is needed, I can provide it at any time. Thank you.
I have the same question.
I think the extra time is due to the introduction of quantization and dequantization layers。
@DDDaar
I think the extra time is due to the introduction of quantization and dequantization layers。
[@DDDaar](https://github.com/DDDaar)
So what methods are available to achieve acceleration? If the conversion time between int8 and fp32 (or fp16) is excessively long, the goal of acceleration will not be achieved.Thanks
I think the extra time is due to the introduction of quantization and dequantization layers。
[@DDDaar](https://github.com/DDDaar)
So what methods are available to achieve acceleration? If the conversion time between int8 and fp32 (or fp16) is excessively long, the goal of acceleration will not be achieved.Thanks
Hello, you can check this solution, i only know this solution for now : https://github.com/NVIDIA/Model-Optimizer/issues/207
[@DDDaar](https://github.com/DDDaar)