int8 model is slower than fp16

Open echosyy opened this issue 10 months ago • 1 comments

Hello, I tried to run this project on SDXL, and the inference speed of int8 model is slower than that of the fp16. In the experiment on the A10 GPU, the original float16 speed was 3.44 iter/s, while the int8 model achieved only 2.5 iter/s. Is this reasonable? torch2.4 + cuda12

Feb 25 '25 03:02 echosyy

Hello alumni. I think it is plausible since torch_quantizer only provides a basic framework for inferencing quantized model within pytorch. It is not well-optimized. For instance, fuse quant/dequant with gemm kernel..

Feb 25 '25 03:02 ThisisBillhe