TensorRT The inference speed of the int8 quantization version of SDXL is much slower than that of fp16

The inference speed of the int 8 quantization version of SDXL is much slower than that of fp16. I am runing trt9.3 sdxl demo and here is the result. (I changed shape to 768x1344 manually) fp16 : python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --width 1344 --height 768 [I] Warming up .. [I] Running StableDiffusionXL pipeline |-----------------|--------------|

Module	Latency
CLIP	2.45 ms
UNet x 30	2616.81 ms
VAE-Dec	222.92 ms
-----------------	--------------
Pipeline	2851.01 ms
-----------------	--------------
Throughput: 0.35 image/s

int8: python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --int8 --quantization-level 3 --width 1344 --height 768 [I] Warming up .. [I] Running StableDiffusionXL pipeline |-----------------|--------------|

Module	Latency
CLIP	2.39 ms
UNet x 30	5550.13 ms
VAE-Dec	223.70 ms
-----------------	--------------
Pipeline	5785.81 ms
-----------------	--------------
Throughput: 0.17 image/s

Mar 19 '24 02:03 theNefelibata

tested on A800 GPU

Mar 19 '24 03:03 theNefelibata

Have you found that the size of onnx model for int8 is much bigger than fp16 ? 截屏2024-03-19 13 49 45

Mar 19 '24 05:03 ApolloRay

Have you found that the size of onnx model for int8 is much bigger than fp16 ?

yes, The onnx model seems to be based on fp32, but .plan file is resonable

Mar 19 '24 05:03 theNefelibata

Have you found that the size of onnx model for int8 is much bigger than fp16 ?

yes, The onnx model seems to be based on fp32, but .plan file is resonable

Me too. And during the inference, the gpu-utils is higher than the fp16. I hope it can be solved.

Mar 19 '24 06:03 ApolloRay

update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------|

Module	Latency
CLIP	2.59 ms
UNet x 30	2315.72 ms
VAE-Dec	216.47 ms
-----------------	--------------
Pipeline	2545.01 ms
-----------------	--------------
Throughput: 0.39 image/s

Mar 20 '24 06:03 theNefelibata

update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.59 ms | | UNet x 30 | 2315.72 ms | | VAE-Dec | 216.47 ms | |-----------------|--------------| | Pipeline | 2545.01 ms | |-----------------|--------------| Throughput: 0.39 image/s

This high probability is not quantified, right? After 30 times of noise reduction, the time almost didn’t change.

Mar 20 '24 07:03 ApolloRay

update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.59 ms | | UNet x 30 | 2315.72 ms | | VAE-Dec | 216.47 ms | |-----------------|--------------| | Pipeline | 2545.01 ms | |-----------------|--------------| Throughput: 0.39 image/s

This high probability is not quantified, right? After 30 times of noise reduction, the time almost didn’t change.

actually each step is 10ms faster...

Mar 20 '24 07:03 theNefelibata

update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.59 ms | | UNet x 30 | 2315.72 ms | | VAE-Dec | 216.47 ms | |-----------------|--------------| | Pipeline | 2545.01 ms | |-----------------|--------------| Throughput: 0.39 image/s

This high probability is not quantified, right? After 30 times of noise reduction, the time almost didn’t change.

actually each step is 10ms faster...

for me, torch in 2.0.1, but nothing happened.

Mar 20 '24 08:03 ApolloRay

@nvpohanh is it expected?

Mar 22 '24 05:03 zerollzeng

Please try --quantization-level 2.5 because TRT currently does not support INT8 MHA fusion for SeqLen>512 due to accuracy reason. Therefore, use --quantization-level 2.5 instead so that the MHA part is not quantized.

Mar 22 '24 11:03 nvpohanh

Please try --quantization-level 2.5 because TRT currently does not support INT8 MHA fusion for SeqLen>512 due to accuracy reason. Therefore, use --quantization-level 2.5 instead so that the MHA part is not quantized.

截屏2024-03-25 11 00 41

Mar 25 '24 03:03 ApolloRay

level 2.5 do not support, I tried level 2.

Mar 28 '24 01:03 theNefelibata

Could you remove choices=range(1,4) here? https://github.com/NVIDIA/TensorRT/blob/release/9.3/demo/Diffusion/utilities.py#L502

We will fix this in next version

cc @rajeevsrao

Apr 01 '24 02:04 nvpohanh

Could you remove choices=range(1,4) here? https://github.com/NVIDIA/TensorRT/blob/release/9.3/demo/Diffusion/utilities.py#L502

We will fix this in next version

cc @rajeevsrao even slower... python demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl-int8 --engine-dir engine-sdxl-int8 --int8 --quantization-level 2.5 --width 1344 --height 768 --seed 123456 [I] Warming up .. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.60 ms | | UNet x 30 | 7193.33 ms | | VAE-Dec | 221.53 ms | |-----------------|--------------| | Pipeline | 7427.78 ms | |-----------------|--------------| Throughput: 0.13 image/s

Apr 01 '24 08:04 theNefelibata

Thanks for trying. that's surprising because it doesn't match our internal measurements... Let me check internally first

Apr 01 '24 09:04 nvpohanh

We will have an update for int8 quantization including new AMMO version and latest calibration scripts in the upcoming GA release, which should resolve this issue. Will test internally and circle back when that is released.

Apr 03 '24 21:04 akhilg-nv

We will have an update for int8 quantization including new AMMO version and latest calibration scripts in the upcoming GA release, which should resolve this issue. Will test internally and circle back when that is released.

ok, thanks. And I would like to know when the next GA version will be release.

Apr 08 '24 01:04 theNefelibata

The release will likely be in May, but we may try to push the fix for this earlier than that.

Apr 09 '24 17:04 akhilg-nv

The release will likely be in May, but we may try to push the fix for this earlier than that. Blog claims that the speed of FP8 is faster. How to quantinaze the model of FP8? Will you provide a demo？

Apr 11 '24 06:04 theNefelibata

Yes, it is planned to add FP8 quantization to the demo. I'm not sure of the exact timeline for when full support will be added.

Apr 11 '24 06:04 akhilg-nv

Yes, it is planned to add FP8 quantization to the demo. I'm not sure of the exact timeline for when full support will be added.

Ok, Thank you for your reply

Apr 11 '24 06:04 theNefelibata

Hi, please check the latest TensorRT 10.0 OSS GA release and let us know if that helps!

Apr 30 '24 19:04 akhilg-nv

@akhilg-nv so need to use trt 10.0 to try if the quantized trt model is faster? Agree with trt 8.6.1 see quantized trt model twice as speed of non-quantized trt model

May 01 '24 05:05 ecilay

You can update the TRT package to 10.0 if you'd like. Great that you are seeing the perf improvement now! Which GPU are you testing on?

May 01 '24 19:05 akhilg-nv

@akhilg-nv sorry I mean the quantized trt model spend twice the time of non-quantized trt model with 8.6.1. I am on A100 80G machine

May 01 '24 20:05 ecilay

I see, could you post the throughput summary (e.g. similar to as seen here) and your repro steps?

Also, could you try the result when upgrading to TRT 10.0? As noted in the readme, you can do python3 -m pip install --upgrade pip && pip install --pre tensorrt-cu12 to upgrade TRT package in container.

May 01 '24 20:05 akhilg-nv

@akhilg-nv so I have to use NCG container? If I just try install tensorrt 10.0.1 in my own conda linux environment, it shows ModuleNotFoundError: No module named 'tensorrt_bindings'. But if I download tensorrt_bindings, by default it downloads the one for 8.6.1

May 01 '24 21:05 ecilay

So I just test the trt10.0.1, and with AMMO, it turns out the quantized trt model is still slower than non-quantized trt model. I modified the code a little bit to make it do quantization for inpainting use case. I can't run SDXL since it will OOM for me when I try to quantize it on A100 machine. Here is what I got: Non quantized trt model perf:

|-----------------|--------------|
|     Module      |   Latency    |
|-----------------|--------------|
|     VAE-Enc     |      9.70 ms |
|      CLIP       |      2.29 ms |
|    UNet x 22    |    373.10 ms |
|     VAE-Dec     |     18.85 ms |
|-----------------|--------------|
|    Pipeline     |    411.26 ms |
|-----------------|--------------|
Throughput: 2.43 image/s

Quantized trt model:

|-----------------|--------------|
|     Module      |   Latency    |
|-----------------|--------------|
|     VAE-Enc     |      9.66 ms |
|      CLIP       |      2.27 ms |
|    UNet x 22    |    377.02 ms |
|     VAE-Dec     |     18.64 ms |
|-----------------|--------------|
|    Pipeline     |    414.32 ms |
|-----------------|--------------|
Throughput: 2.41 image/s

May 01 '24 21:05 ecilay

Is A100 machine a good test GPU device? What Quantization with AMMO works best on which GPUs?

May 01 '24 22:05 ecilay

Please try --quantization-level 2.5 because TRT currently does not support INT8 MHA fusion for SeqLen>512 due to accuracy reason. Therefore, use --quantization-level 2.5 instead so that the MHA part is not quantized.

I think 2.5 is for CNN+FFN+QKV and 3 is for CNN+FC. If INT8 MHA fusion for SeqLen>512 is not available, we should use quantization-level 3?

May 01 '24 22:05 ecilay

TensorRT TensorRT copied to clipboard

The inference speed of the int8 quantization version of SDXL is much slower than that of fp16

TensorRT
TensorRT copied to clipboard