torch_quantizer icon indicating copy to clipboard operation
torch_quantizer copied to clipboard

quantization for performance vs for memory

Open aifartist opened this issue 1 year ago • 3 comments

I push SD performance to the maximum. Currently I can generate 200 images per second on my 4090 when using 1 step sd-turbo, the onediff compiler, the stable-fast compiler, and my own optimizations. This is batchsize=12 at 512x512.

I've been trying to get more knowledge of quantization. I have managed to get your code to work. However, I had to fix things like making tensors contiguous and the fact that there is no 'in_features' for the models I'm using. I have to use shape[0] and for out_features shape[1].

Once I got it working I was surprised it was so slow and I believe it might be because of qconv2d_8bit and qlinear_8bit both seem to constantly quantify and dequantify around every single operation instead of doing it once when there is a sequence of consecutive operations which all support quantification. I don't even know the correct terminology to express this.

The use case I'm focusing on right now is the "TinyVAE" which is a necessity for 200 images per second and also for realtime video using LCM to hit 25+ frames per second. I won't show the complete model tree but one module, the AutoencoderTinyBlock, occurs 10 times and looks like:

      (17): AutoencoderTinyBlock(
        (conv): Sequential(
          (0): qconv2d_8bit()
          (1): ReLU()
          (2): qconv2d_8bit()
          (3): ReLU()
          (4): qconv2d_8bit()
        )
        (skip): Identity()
        (fuse): ReLU()
      )

While I don't think you are yet quantizing the ReLU I have found that there is a built-in fused convReLU2d operator for quint8. Unfortunately after I got the entire model converted to quint8 correctly I found that 'currently' it only works on the CPU.

If you could somehow do the ReLU quantized and fuse your conv2d with it we might be able to get a seamless end 2 end VAE done in qint8. In that case I would not be surprised that it wasn't a bit faster.

Just a thought given that I'm far from a quantization expert.

aifartist avatar Feb 26 '24 04:02 aifartist

Hi! You are right, I haven't done any operation fusion yet (both Conv+ReLU and Dequant+Quant). Another reason for slow inference speed is that my dequantization CUDA kernel is slow, which I am trying to optimize. The Conv+ReLU fusion should be easier to implement, I will see what I can do.

ThisisBillhe avatar Feb 26 '24 06:02 ThisisBillhe

Hi! You are right, I haven't done any operation fusion yet (both Conv+ReLU and Dequant+Quant). Another reason for slow inference speed is that my dequantization CUDA kernel is slow, which I am trying to optimize. The Conv+ReLU fusion should be easier to implement, I will see what I can do.

Thanks! No hurry but let me know. I am willing to run any tests on my 4090 once you have something.

aifartist avatar Feb 26 '24 07:02 aifartist

Also, even if your dequant was slow it'd not be to bad if it was only done once for the 100+ individual layers/modules in the TinyVae.

Even though I could not execute the complete quantized model for the VAE on cuda that I created using fx tracing, because it only supports cpu, I did see in the graph that the quant conversion to and from qint8 was only done at the beginning and end of the entire vae process.

aifartist avatar Feb 26 '24 07:02 aifartist