TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

The inference speed of the int8 quantization version of SDXL is much slower than that of fp16

Open theNefelibata opened this issue 1 year ago • 42 comments

The inference speed of the int 8 quantization version of SDXL is much slower than that of fp16. I am runing trt9.3 sdxl demo and here is the result. (I changed shape to 768x1344 manually) fp16 : python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --width 1344 --height 768 [I] Warming up .. [I] Running StableDiffusionXL pipeline |-----------------|--------------|

Module Latency
CLIP 2.45 ms
UNet x 30 2616.81 ms
VAE-Dec 222.92 ms
----------------- --------------
Pipeline 2851.01 ms
----------------- --------------
Throughput: 0.35 image/s

int8: python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --int8 --quantization-level 3 --width 1344 --height 768 [I] Warming up .. [I] Running StableDiffusionXL pipeline |-----------------|--------------|

Module Latency
CLIP 2.39 ms
UNet x 30 5550.13 ms
VAE-Dec 223.70 ms
----------------- --------------
Pipeline 5785.81 ms
----------------- --------------
Throughput: 0.17 image/s

theNefelibata avatar Mar 19 '24 02:03 theNefelibata

tested on A800 GPU

theNefelibata avatar Mar 19 '24 03:03 theNefelibata

Have you found that the size of onnx model for int8 is much bigger than fp16 ? 截屏2024-03-19 13 49 45

ApolloRay avatar Mar 19 '24 05:03 ApolloRay

Have you found that the size of onnx model for int8 is much bigger than fp16 ? 截屏2024-03-19 13 49 45

yes, The onnx model seems to be based on fp32, but .plan file is resonable image

theNefelibata avatar Mar 19 '24 05:03 theNefelibata

Have you found that the size of onnx model for int8 is much bigger than fp16 ? 截屏2024-03-19 13 49 45

yes, The onnx model seems to be based on fp32, but .plan file is resonable image

Me too. And during the inference, the gpu-utils is higher than the fp16. I hope it can be solved.

ApolloRay avatar Mar 19 '24 06:03 ApolloRay

update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------|

Module Latency
CLIP 2.59 ms
UNet x 30 2315.72 ms
VAE-Dec 216.47 ms
----------------- --------------
Pipeline 2545.01 ms
----------------- --------------
Throughput: 0.39 image/s

theNefelibata avatar Mar 20 '24 06:03 theNefelibata

update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.59 ms | | UNet x 30 | 2315.72 ms | | VAE-Dec | 216.47 ms | |-----------------|--------------| | Pipeline | 2545.01 ms | |-----------------|--------------| Throughput: 0.39 image/s

This high probability is not quantified, right? After 30 times of noise reduction, the time almost didn’t change.

ApolloRay avatar Mar 20 '24 07:03 ApolloRay

update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.59 ms | | UNet x 30 | 2315.72 ms | | VAE-Dec | 216.47 ms | |-----------------|--------------| | Pipeline | 2545.01 ms | |-----------------|--------------| Throughput: 0.39 image/s

This high probability is not quantified, right? After 30 times of noise reduction, the time almost didn’t change.

actually each step is 10ms faster...

theNefelibata avatar Mar 20 '24 07:03 theNefelibata

update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.59 ms | | UNet x 30 | 2315.72 ms | | VAE-Dec | 216.47 ms | |-----------------|--------------| | Pipeline | 2545.01 ms | |-----------------|--------------| Throughput: 0.39 image/s

This high probability is not quantified, right? After 30 times of noise reduction, the time almost didn’t change.

actually each step is 10ms faster...

for me, torch in 2.0.1, but nothing happened.

ApolloRay avatar Mar 20 '24 08:03 ApolloRay

@nvpohanh is it expected?

zerollzeng avatar Mar 22 '24 05:03 zerollzeng

Please try --quantization-level 2.5 because TRT currently does not support INT8 MHA fusion for SeqLen>512 due to accuracy reason. Therefore, use --quantization-level 2.5 instead so that the MHA part is not quantized.

nvpohanh avatar Mar 22 '24 11:03 nvpohanh

Please try --quantization-level 2.5 because TRT currently does not support INT8 MHA fusion for SeqLen>512 due to accuracy reason. Therefore, use --quantization-level 2.5 instead so that the MHA part is not quantized.

截屏2024-03-25 11 00 41

ApolloRay avatar Mar 25 '24 03:03 ApolloRay

level 2.5 do not support, I tried level 2. image

theNefelibata avatar Mar 28 '24 01:03 theNefelibata

Could you remove choices=range(1,4) here? https://github.com/NVIDIA/TensorRT/blob/release/9.3/demo/Diffusion/utilities.py#L502

We will fix this in next version

cc @rajeevsrao

nvpohanh avatar Apr 01 '24 02:04 nvpohanh

Could you remove choices=range(1,4) here? https://github.com/NVIDIA/TensorRT/blob/release/9.3/demo/Diffusion/utilities.py#L502

We will fix this in next version

cc @rajeevsrao even slower... python demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl-int8 --engine-dir engine-sdxl-int8 --int8 --quantization-level 2.5 --width 1344 --height 768 --seed 123456 [I] Warming up .. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.60 ms | | UNet x 30 | 7193.33 ms | | VAE-Dec | 221.53 ms | |-----------------|--------------| | Pipeline | 7427.78 ms | |-----------------|--------------| Throughput: 0.13 image/s

theNefelibata avatar Apr 01 '24 08:04 theNefelibata

Thanks for trying. that's surprising because it doesn't match our internal measurements... Let me check internally first

nvpohanh avatar Apr 01 '24 09:04 nvpohanh

We will have an update for int8 quantization including new AMMO version and latest calibration scripts in the upcoming GA release, which should resolve this issue. Will test internally and circle back when that is released.

akhilg-nv avatar Apr 03 '24 21:04 akhilg-nv

We will have an update for int8 quantization including new AMMO version and latest calibration scripts in the upcoming GA release, which should resolve this issue. Will test internally and circle back when that is released.

ok, thanks. And I would like to know when the next GA version will be release.

theNefelibata avatar Apr 08 '24 01:04 theNefelibata

The release will likely be in May, but we may try to push the fix for this earlier than that.

akhilg-nv avatar Apr 09 '24 17:04 akhilg-nv

The release will likely be in May, but we may try to push the fix for this earlier than that. Blog claims that the speed of FP8 is faster. How to quantinaze the model of FP8? Will you provide a demo?

theNefelibata avatar Apr 11 '24 06:04 theNefelibata

Yes, it is planned to add FP8 quantization to the demo. I'm not sure of the exact timeline for when full support will be added.

akhilg-nv avatar Apr 11 '24 06:04 akhilg-nv

Yes, it is planned to add FP8 quantization to the demo. I'm not sure of the exact timeline for when full support will be added.

Ok, Thank you for your reply

theNefelibata avatar Apr 11 '24 06:04 theNefelibata

Hi, please check the latest TensorRT 10.0 OSS GA release and let us know if that helps!

akhilg-nv avatar Apr 30 '24 19:04 akhilg-nv

@akhilg-nv so need to use trt 10.0 to try if the quantized trt model is faster? Agree with trt 8.6.1 see quantized trt model twice as speed of non-quantized trt model

ecilay avatar May 01 '24 05:05 ecilay

You can update the TRT package to 10.0 if you'd like. Great that you are seeing the perf improvement now! Which GPU are you testing on?

akhilg-nv avatar May 01 '24 19:05 akhilg-nv

@akhilg-nv sorry I mean the quantized trt model spend twice the time of non-quantized trt model with 8.6.1. I am on A100 80G machine

ecilay avatar May 01 '24 20:05 ecilay

I see, could you post the throughput summary (e.g. similar to as seen here) and your repro steps?

Also, could you try the result when upgrading to TRT 10.0? As noted in the readme, you can do python3 -m pip install --upgrade pip && pip install --pre tensorrt-cu12 to upgrade TRT package in container.

akhilg-nv avatar May 01 '24 20:05 akhilg-nv

@akhilg-nv so I have to use NCG container? If I just try install tensorrt 10.0.1 in my own conda linux environment, it shows ModuleNotFoundError: No module named 'tensorrt_bindings'. But if I download tensorrt_bindings, by default it downloads the one for 8.6.1

ecilay avatar May 01 '24 21:05 ecilay

So I just test the trt10.0.1, and with AMMO, it turns out the quantized trt model is still slower than non-quantized trt model. I modified the code a little bit to make it do quantization for inpainting use case. I can't run SDXL since it will OOM for me when I try to quantize it on A100 machine. Here is what I got: Non quantized trt model perf:

|-----------------|--------------|
|     Module      |   Latency    |
|-----------------|--------------|
|     VAE-Enc     |      9.70 ms |
|      CLIP       |      2.29 ms |
|    UNet x 22    |    373.10 ms |
|     VAE-Dec     |     18.85 ms |
|-----------------|--------------|
|    Pipeline     |    411.26 ms |
|-----------------|--------------|
Throughput: 2.43 image/s

Quantized trt model:

|-----------------|--------------|
|     Module      |   Latency    |
|-----------------|--------------|
|     VAE-Enc     |      9.66 ms |
|      CLIP       |      2.27 ms |
|    UNet x 22    |    377.02 ms |
|     VAE-Dec     |     18.64 ms |
|-----------------|--------------|
|    Pipeline     |    414.32 ms |
|-----------------|--------------|
Throughput: 2.41 image/s

ecilay avatar May 01 '24 21:05 ecilay

Is A100 machine a good test GPU device? What Quantization with AMMO works best on which GPUs?

ecilay avatar May 01 '24 22:05 ecilay

Please try --quantization-level 2.5 because TRT currently does not support INT8 MHA fusion for SeqLen>512 due to accuracy reason. Therefore, use --quantization-level 2.5 instead so that the MHA part is not quantized.

I think 2.5 is for CNN+FFN+QKV and 3 is for CNN+FC. If INT8 MHA fusion for SeqLen>512 is not available, we should use quantization-level 3?

ecilay avatar May 01 '24 22:05 ecilay