TensorRT-LLM WOQ is not giving any performance speedup in whisper

System Info

CPU: x86_64
CPU memory size: 96GB
GPU: A100
Tensorrt tag: latest one(https://github.com/NVIDIA/TensorRT-LLM/commit/5d8ca2faf74c494f220c8f71130340b513eea9a9)
TensorRT version: build from source for cc: 80-real
Nvidia-driver version: 525.125.06
OS: Debian 4.19.304-1 (2024-01-09) x86_64 GNU/Linux

Who can help?

@kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

To reproduce this issue, build the Whisper medium model in two versions: one as normal and the other using weight-only quantization (WOQ).

Expected Behavior

The version built with WOQ is expected to have better performance.

actual behavior

********
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_bs10_bw5_FP16
RTF: 0.0174
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.357 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.62%
********
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_woq_bs10_bw5_FP16
RTF: 0.0174
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.365 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.70%
********

Additional Notes

I think this issue may be related to GPU architecture. Is there another method for using quantization with the Whisper model?

May 22 '24 20:05 robosina

I think this issue may be related to GPU architecture. Is there another method for using quantization with the Whisper model?

Yes, WOQ int8 cannot significantly improve inference speed, but it can reduce memory usage while maintaining the same speed. For example, the memory usage of Whisper Large can be reduced by 1.5 GB. In the future, we will support the FP8 quantization scheme, which can improve inference speed. For the Ampere architecture, SmoothQuant int8 can also speed up inference, but its support priority will be after FP8.

Additionally, you can try the inference of Whisper Large v3. On the A100, it has better accuracy and is not much slower than the Medium model. @robosina

May 23 '24 01:05 yuekaizhang

@yuekaizhang Thanks for the answer, but please check the following information:

**********************************************************
running benchmark for whisper_medium_bs10_bw5_FP16
**********************************************************
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_bs10_bw5_FP16
RTF: 0.0172
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.285 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.62%
total 1.6G
-rw-r--r-- 1 root root 1.7K May 22 20:06 decoder_config.json
-rw-r--r-- 1 root root 1.4K May 22 20:06 encoder_config.json
-rw-r--r-- 1 root root 978M May 22 20:06 whisper_decoder_float16_tp1_rank0.engine
-rw-r--r-- 1 root root 589M May 22 20:06 whisper_encoder_float16_tp1_rank0.engine
**********************************************************
running benchmark for whisper_medium_woq_bs10_bw5_FP16
**********************************************************
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_woq_bs10_bw5_FP16
RTF: 0.0178
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.556 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.70%
total 897M
-rw-r--r-- 1 root root 1.7K May 22 20:16 decoder_config.json
-rw-r--r-- 1 root root 1.4K May 22 20:16 encoder_config.json
-rw-r--r-- 1 root root 595M May 22 20:16 whisper_decoder_float16_tp1_rank0.engine
-rw-r--r-- 1 root root 302M May 22 20:16 whisper_encoder_float16_tp1_rank0.engine
**********************************************************
running benchmark for whisper_medium_woq4_bs10_bw5_FP16
**********************************************************
engine_dir: /app/tensorrt_llm/examples/ts-whisper/ts_whisper/benchmark/whisper_models/whisper_medium_woq4_bs10_bw5_FP16
RTF: 0.0181
total_duration: 481.035 seconds
(0.13 hours)
processing time: 8.697 seconds (0.00 hours)
batch size: 10
num_beams: 5
total error rate: 4.70%
total 561M
-rw-r--r-- 1 root root 1.7K May 23 11:06 decoder_config.json
-rw-r--r-- 1 root root 1.4K May 23 11:05 encoder_config.json
-rw-r--r-- 1 root root 403M May 23 11:06 whisper_decoder_float16_tp1_rank0.engine
-rw-r--r-- 1 root root 158M May 23 11:05 whisper_encoder_float16_tp1_rank0.engine

Based on the above information I'm sure that the second and third models are using WOQ since they are using less disk space(ls infos). However, please check the memory consumption.

gpu_usage_plot

In this case, the second one uses the WOQ method, and the third one is using woq-4bit As you can see, the memory consumption is not significantly better. Is this usual, or am I encountering a bug here? Thanks in advance.

May 23 '24 09:05 robosina

processing time: 8.697 seconds (0.00 hours) batch size: 10 num_beams: 5 total error rate: 4.70%

@robosina Hi, thanks for your investigation. I have done some perf jobs. And indeed, found that whisper encoder fp16 is faster than int8 WOQ. We would work on the issue and update here.

For Whisper Decoder, WOQ int8 should be faster than fp16.

For the VRAM usage, whisper medium has 0.7B parameters. In this way, WOQ int8 would decrease about 700M VRAM, and WOQ int4 would decrease another 350M VRAM usage.

May 27 '24 02:05 yuekaizhang

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Jun 27 '24 01:06 github-actions[bot]

Hi @robosina do u still have further issue or question now? If not, we'll close it soon.

Nov 14 '24 03:11 nv-guomingz

Hi @nv-guomingz, Yes, this issue has been resolved in the recent releases, so I will do it myself, thanks for the support.

Nov 14 '24 11:11 robosina

TensorRT-LLM TensorRT-LLM copied to clipboard

WOQ is not giving any performance speedup in whisper

System Info

Who can help?

Information

Tasks

Reproduction

Expected Behavior

actual behavior

Additional Notes

TensorRT-LLM
TensorRT-LLM copied to clipboard