tiny-tensorrt Why does Int8 quantization occupy more GPU graphics memory than float16, TensorRT quantization ?

Why does Int8 quantization occupy more GPU graphics memory than float16, TensorRT quantization ?

Open nameli0722 opened this issue 2 years ago • 8 comments

please descript your problem in English if possible. it will to helpful to more people Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior: 1. 2.

Screenshots If applicable, add screenshots to help explain your problem.

System environment (please complete the following information):

Device:
OS:
Driver version:
CUDA version:
TensorRT version:
Others:

Cmake output

Running output

May 29 '23 07:05 nameli0722

我是通过Ttiny-tensorrt来做量化

May 30 '23 04:05 nameli0722

it's expected, the process of int8 quantization require FP32 inference to compute the scale.

May 30 '23 15:05 zerollzeng

I can't understand what you mean. I used int8 quantization and calibration set, and the inference result is also correct. It's large, but the GPU memory usage is larger than float16. 我不能理解您的意思，我使用了int8量化，用了校准集，推理结果也是对的，大，但是gpu显存占用比float16还大

May 31 '23 01:05 nameli0722

@zerollzeng Thank you very much!

May 31 '23 01:05 nameli0722

Hello, could you please provide the gpu usage and inference speed， with int8 and FP16?

Jun 01 '23 05:06 QiangZhangCV

Hello, could you please provide the gpu usage and inference speed， with int8 and FP16?

thank you!

origin pt model: gpu usage 5099MB, inference time 1.7s;

tiny-tensorrt float16: gpu usage 3993 MB, inference time 0.4s;

tiny-tensorrt int 8 : gpu usage 4509 MB, inference time 0.4s;

all result is ok.

Jun 01 '23 10:06 nameli0722

How about building the engine first and then load the engine, I think it can save some memory.

Anyway I'll try to improve this.

Jun 01 '23 14:06 zerollzeng

How about building the engine first and then load the engine, I think it can save some memory.

Anyway I'll try to improve this.

./tinyexec --onnx /data/sdb/manager/RX0249_liming/coronary_model/onnx_model/unet.onnx --mode 2 --batch_size 1 --save_engine /data/sdb/manager/RX0249_liming/coronary_model/tiny_trt_model/float16_int8_calib/unet.trt --int8 --calibrate_data /data/sdb/manager/RX0249_liming/calib_data/tinyrt_data/

thank you!

Jun 02 '23 01:06 nameli0722

tiny-tensorrt tiny-tensorrt copied to clipboard

Why does Int8 quantization occupy more GPU graphics memory than float16, TensorRT quantization ?

tiny-tensorrt
tiny-tensorrt copied to clipboard