TensorRT
TensorRT copied to clipboard
The inference speed of the int8 quantization version of SDXL is much slower than that of fp16
The inference speed of the int 8 quantization version of SDXL is much slower than that of fp16. I am runing trt9.3 sdxl demo and here is the result. (I changed shape to 768x1344 manually) fp16 : python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --width 1344 --height 768 [I] Warming up .. [I] Running StableDiffusionXL pipeline |-----------------|--------------|
| Module | Latency |
|---|---|
| CLIP | 2.45 ms |
| UNet x 30 | 2616.81 ms |
| VAE-Dec | 222.92 ms |
| ----------------- | -------------- |
| Pipeline | 2851.01 ms |
| ----------------- | -------------- |
| Throughput: 0.35 image/s |
int8: python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --int8 --quantization-level 3 --width 1344 --height 768 [I] Warming up .. [I] Running StableDiffusionXL pipeline |-----------------|--------------|
| Module | Latency |
|---|---|
| CLIP | 2.39 ms |
| UNet x 30 | 5550.13 ms |
| VAE-Dec | 223.70 ms |
| ----------------- | -------------- |
| Pipeline | 5785.81 ms |
| ----------------- | -------------- |
| Throughput: 0.17 image/s |
tested on A800 GPU
Have you found that the size of onnx model for int8 is much bigger than fp16 ?
Have you found that the size of onnx model for int8 is much bigger than fp16 ?
yes, The onnx model seems to be based on fp32, but .plan file is resonable
Have you found that the size of onnx model for int8 is much bigger than fp16 ?
yes, The onnx model seems to be based on fp32, but .plan file is resonable
Me too. And during the inference, the gpu-utils is higher than the fp16. I hope it can be solved.
update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------|
| Module | Latency |
|---|---|
| CLIP | 2.59 ms |
| UNet x 30 | 2315.72 ms |
| VAE-Dec | 216.47 ms |
| ----------------- | -------------- |
| Pipeline | 2545.01 ms |
| ----------------- | -------------- |
| Throughput: 0.39 image/s |
update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.59 ms | | UNet x 30 | 2315.72 ms | | VAE-Dec | 216.47 ms | |-----------------|--------------| | Pipeline | 2545.01 ms | |-----------------|--------------| Throughput: 0.39 image/s
This high probability is not quantified, right? After 30 times of noise reduction, the time almost didn’t change.
update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.59 ms | | UNet x 30 | 2315.72 ms | | VAE-Dec | 216.47 ms | |-----------------|--------------| | Pipeline | 2545.01 ms | |-----------------|--------------| Throughput: 0.39 image/s
This high probability is not quantified, right? After 30 times of noise reduction, the time almost didn’t change.
actually each step is 10ms faster...
update pytorch 2.0, it works, but the speed improvement is very small. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.59 ms | | UNet x 30 | 2315.72 ms | | VAE-Dec | 216.47 ms | |-----------------|--------------| | Pipeline | 2545.01 ms | |-----------------|--------------| Throughput: 0.39 image/s
This high probability is not quantified, right? After 30 times of noise reduction, the time almost didn’t change.
actually each step is 10ms faster...
for me, torch in 2.0.1, but nothing happened.
@nvpohanh is it expected?
Please try --quantization-level 2.5 because TRT currently does not support INT8 MHA fusion for SeqLen>512 due to accuracy reason. Therefore, use --quantization-level 2.5 instead so that the MHA part is not quantized.
Please try
--quantization-level 2.5because TRT currently does not support INT8 MHA fusion for SeqLen>512 due to accuracy reason. Therefore, use--quantization-level 2.5instead so that the MHA part is not quantized.
level 2.5 do not support, I tried level 2.
Could you remove choices=range(1,4) here? https://github.com/NVIDIA/TensorRT/blob/release/9.3/demo/Diffusion/utilities.py#L502
We will fix this in next version
cc @rajeevsrao
Could you remove
choices=range(1,4)here? https://github.com/NVIDIA/TensorRT/blob/release/9.3/demo/Diffusion/utilities.py#L502We will fix this in next version
cc @rajeevsrao even slower... python demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl-int8 --engine-dir engine-sdxl-int8 --int8 --quantization-level 2.5 --width 1344 --height 768 --seed 123456 [I] Warming up .. [I] Running StableDiffusionXL pipeline |-----------------|--------------| | Module | Latency | |-----------------|--------------| | CLIP | 2.60 ms | | UNet x 30 | 7193.33 ms | | VAE-Dec | 221.53 ms | |-----------------|--------------| | Pipeline | 7427.78 ms | |-----------------|--------------| Throughput: 0.13 image/s
Thanks for trying. that's surprising because it doesn't match our internal measurements... Let me check internally first
We will have an update for int8 quantization including new AMMO version and latest calibration scripts in the upcoming GA release, which should resolve this issue. Will test internally and circle back when that is released.
We will have an update for int8 quantization including new AMMO version and latest calibration scripts in the upcoming GA release, which should resolve this issue. Will test internally and circle back when that is released.
ok, thanks. And I would like to know when the next GA version will be release.
The release will likely be in May, but we may try to push the fix for this earlier than that.
The release will likely be in May, but we may try to push the fix for this earlier than that. Blog claims that the speed of FP8 is faster. How to quantinaze the model of FP8? Will you provide a demo?
Yes, it is planned to add FP8 quantization to the demo. I'm not sure of the exact timeline for when full support will be added.
Yes, it is planned to add FP8 quantization to the demo. I'm not sure of the exact timeline for when full support will be added.
Ok, Thank you for your reply
Hi, please check the latest TensorRT 10.0 OSS GA release and let us know if that helps!
@akhilg-nv so need to use trt 10.0 to try if the quantized trt model is faster? Agree with trt 8.6.1 see quantized trt model twice as speed of non-quantized trt model
You can update the TRT package to 10.0 if you'd like. Great that you are seeing the perf improvement now! Which GPU are you testing on?
@akhilg-nv sorry I mean the quantized trt model spend twice the time of non-quantized trt model with 8.6.1. I am on A100 80G machine
I see, could you post the throughput summary (e.g. similar to as seen here) and your repro steps?
Also, could you try the result when upgrading to TRT 10.0? As noted in the readme, you can do python3 -m pip install --upgrade pip && pip install --pre tensorrt-cu12 to upgrade TRT package in container.
@akhilg-nv so I have to use NCG container? If I just try install tensorrt 10.0.1 in my own conda linux environment, it shows
ModuleNotFoundError: No module named 'tensorrt_bindings'. But if I download tensorrt_bindings, by default it downloads the one for 8.6.1
So I just test the trt10.0.1, and with AMMO, it turns out the quantized trt model is still slower than non-quantized trt model. I modified the code a little bit to make it do quantization for inpainting use case. I can't run SDXL since it will OOM for me when I try to quantize it on A100 machine. Here is what I got:
Non quantized trt model perf:
|-----------------|--------------|
| Module | Latency |
|-----------------|--------------|
| VAE-Enc | 9.70 ms |
| CLIP | 2.29 ms |
| UNet x 22 | 373.10 ms |
| VAE-Dec | 18.85 ms |
|-----------------|--------------|
| Pipeline | 411.26 ms |
|-----------------|--------------|
Throughput: 2.43 image/s
Quantized trt model:
|-----------------|--------------|
| Module | Latency |
|-----------------|--------------|
| VAE-Enc | 9.66 ms |
| CLIP | 2.27 ms |
| UNet x 22 | 377.02 ms |
| VAE-Dec | 18.64 ms |
|-----------------|--------------|
| Pipeline | 414.32 ms |
|-----------------|--------------|
Throughput: 2.41 image/s
Is A100 machine a good test GPU device? What Quantization with AMMO works best on which GPUs?
Please try
--quantization-level 2.5because TRT currently does not support INT8 MHA fusion for SeqLen>512 due to accuracy reason. Therefore, use--quantization-level 2.5instead so that the MHA part is not quantized.
I think 2.5 is for CNN+FFN+QKV and 3 is for CNN+FC. If INT8 MHA fusion for SeqLen>512 is not available, we should use quantization-level 3?

