tiny-tensorrt
tiny-tensorrt copied to clipboard
Why does Int8 quantization occupy more GPU graphics memory than float16, TensorRT quantization ?
please descript your problem in English if possible. it will to helpful to more people Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior: 1. 2.
Screenshots If applicable, add screenshots to help explain your problem.
System environment (please complete the following information):
- Device:
- OS:
- Driver version:
- CUDA version:
- TensorRT version:
- Others:
Cmake output
Running output
我是通过Ttiny-tensorrt来做量化
it's expected, the process of int8 quantization require FP32 inference to compute the scale.
I can't understand what you mean. I used int8 quantization and calibration set, and the inference result is also correct. It's large, but the GPU memory usage is larger than float16. 我不能理解您的意思,我使用了int8量化,用了校准集,推理结果也是对的,大,但是gpu显存占用比float16还大
@zerollzeng Thank you very much!
Hello, could you please provide the gpu usage and inference speed, with int8 and FP16?
Hello, could you please provide the gpu usage and inference speed, with int8 and FP16?
thank you!
origin pt model: gpu usage 5099MB, inference time 1.7s;
tiny-tensorrt float16: gpu usage 3993 MB, inference time 0.4s;
tiny-tensorrt int 8 : gpu usage 4509 MB, inference time 0.4s;
all result is ok.
How about building the engine first and then load the engine, I think it can save some memory.
Anyway I'll try to improve this.
How about building the engine first and then load the engine, I think it can save some memory.
Anyway I'll try to improve this.
./tinyexec --onnx /data/sdb/manager/RX0249_liming/coronary_model/onnx_model/unet.onnx --mode 2 --batch_size 1 --save_engine /data/sdb/manager/RX0249_liming/coronary_model/tiny_trt_model/float16_int8_calib/unet.trt --int8 --calibrate_data /data/sdb/manager/RX0249_liming/calib_data/tinyrt_data/
thank you!