tiny-tensorrt icon indicating copy to clipboard operation
tiny-tensorrt copied to clipboard

Why does Int8 quantization occupy more GPU graphics memory than float16, TensorRT quantization ?

Open nameli0722 opened this issue 2 years ago • 8 comments

please descript your problem in English if possible. it will to helpful to more people Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior: 1. 2.

Screenshots If applicable, add screenshots to help explain your problem.

System environment (please complete the following information):

  • Device:
  • OS:
  • Driver version:
  • CUDA version:
  • TensorRT version:
  • Others:

Cmake output

Running output

nameli0722 avatar May 29 '23 07:05 nameli0722

我是通过Ttiny-tensorrt来做量化

nameli0722 avatar May 30 '23 04:05 nameli0722

it's expected, the process of int8 quantization require FP32 inference to compute the scale.

zerollzeng avatar May 30 '23 15:05 zerollzeng

I can't understand what you mean. I used int8 quantization and calibration set, and the inference result is also correct. It's large, but the GPU memory usage is larger than float16. 我不能理解您的意思,我使用了int8量化,用了校准集,推理结果也是对的,大,但是gpu显存占用比float16还大

nameli0722 avatar May 31 '23 01:05 nameli0722

@zerollzeng Thank you very much!

nameli0722 avatar May 31 '23 01:05 nameli0722

Hello, could you please provide the gpu usage and inference speed, with int8 and FP16?

QiangZhangCV avatar Jun 01 '23 05:06 QiangZhangCV

Hello, could you please provide the gpu usage and inference speed, with int8 and FP16?

thank you!

origin pt model: gpu usage 5099MB, inference time 1.7s;

tiny-tensorrt float16: gpu usage 3993 MB, inference time 0.4s;

tiny-tensorrt int 8 : gpu usage 4509 MB, inference time 0.4s;

all result is ok.

nameli0722 avatar Jun 01 '23 10:06 nameli0722

How about building the engine first and then load the engine, I think it can save some memory.

Anyway I'll try to improve this.

zerollzeng avatar Jun 01 '23 14:06 zerollzeng

How about building the engine first and then load the engine, I think it can save some memory.

Anyway I'll try to improve this.

./tinyexec --onnx /data/sdb/manager/RX0249_liming/coronary_model/onnx_model/unet.onnx --mode 2 --batch_size 1 --save_engine /data/sdb/manager/RX0249_liming/coronary_model/tiny_trt_model/float16_int8_calib/unet.trt --int8 --calibrate_data /data/sdb/manager/RX0249_liming/calib_data/tinyrt_data/

thank you!

nameli0722 avatar Jun 02 '23 01:06 nameli0722