torch_quantizer
torch_quantizer copied to clipboard
int8 model is slower than fp16
Hello, I tried to run this project on SDXL, and the inference speed of int8 model is slower than that of the fp16. In the experiment on the A10 GPU, the original float16 speed was 3.44 iter/s, while the int8 model achieved only 2.5 iter/s. Is this reasonable? torch2.4 + cuda12
Hello alumni. I think it is plausible since torch_quantizer only provides a basic framework for inferencing quantized model within pytorch. It is not well-optimized. For instance, fuse quant/dequant with gemm kernel..