torch_quantizer quantization for performance vs for memory

I push SD performance to the maximum. Currently I can generate 200 images per second on my 4090 when using 1 step sd-turbo, the onediff compiler, the stable-fast compiler, and my own optimizations. This is batchsize=12 at 512x512.

I've been trying to get more knowledge of quantization. I have managed to get your code to work. However, I had to fix things like making tensors contiguous and the fact that there is no 'in_features' for the models I'm using. I have to use shape[0] and for out_features shape[1].

Once I got it working I was surprised it was so slow and I believe it might be because of qconv2d_8bit and qlinear_8bit both seem to constantly quantify and dequantify around every single operation instead of doing it once when there is a sequence of consecutive operations which all support quantification. I don't even know the correct terminology to express this.

The use case I'm focusing on right now is the "TinyVAE" which is a necessity for 200 images per second and also for realtime video using LCM to hit 25+ frames per second. I won't show the complete model tree but one module, the AutoencoderTinyBlock, occurs 10 times and looks like:

      (17): AutoencoderTinyBlock(
        (conv): Sequential(
          (0): qconv2d_8bit()
          (1): ReLU()
          (2): qconv2d_8bit()
          (3): ReLU()
          (4): qconv2d_8bit()
        )
        (skip): Identity()
        (fuse): ReLU()
      )

While I don't think you are yet quantizing the ReLU I have found that there is a built-in fused convReLU2d operator for quint8. Unfortunately after I got the entire model converted to quint8 correctly I found that 'currently' it only works on the CPU.

If you could somehow do the ReLU quantized and fuse your conv2d with it we might be able to get a seamless end 2 end VAE done in qint8. In that case I would not be surprised that it wasn't a bit faster.

Just a thought given that I'm far from a quantization expert.

Feb 26 '24 04:02 aifartist

Hi! You are right, I haven't done any operation fusion yet (both Conv+ReLU and Dequant+Quant). Another reason for slow inference speed is that my dequantization CUDA kernel is slow, which I am trying to optimize. The Conv+ReLU fusion should be easier to implement, I will see what I can do.

Feb 26 '24 06:02 ThisisBillhe

Hi! You are right, I haven't done any operation fusion yet (both Conv+ReLU and Dequant+Quant). Another reason for slow inference speed is that my dequantization CUDA kernel is slow, which I am trying to optimize. The Conv+ReLU fusion should be easier to implement, I will see what I can do.

Thanks! No hurry but let me know. I am willing to run any tests on my 4090 once you have something.

Feb 26 '24 07:02 aifartist

Also, even if your dequant was slow it'd not be to bad if it was only done once for the 100+ individual layers/modules in the TinyVae.

Even though I could not execute the complete quantized model for the VAE on cuda that I created using fx tracing, because it only supports cpu, I did see in the graph that the quant conversion to and from qint8 was only done at the beginning and end of the entire vae process.

Feb 26 '24 07:02 aifartist