TensorRT-LLM
TensorRT-LLM copied to clipboard
Whisper build fails with `--remove_input_padding` option
System Info
- GPU V100, A100
- docker image
nvidia/cuda:12.1.0-devel-ubuntu22.04 - tensorrt-llm
0.9.0.dev2024020600
Who can help?
@byshiue
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
running build.py script from examples/whisper with --remove_input_padding option:
python3 build.py --output_dir whisper_large_v3_no_pad --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --model_dir /assets/ --model_name large-v3 --remove_input_padding
Expected behavior
serialized engine expected, without necessity for padding batch to 30 seconds samples [batch_size, n_mels, 3000]
actual behavior
actually build.py script will be failed with assertion in encoder attention layer:
Traceback (most recent call last):
File "/TensorRT-LLM/examples/whisper/build.py", line 365, in <module>
run_build(args)
File "/TensorRT-LLM/examples/whisper/build.py", line 359, in run_build
build_encoder(model, args)
File "/TensorRT-LLM/examples/whisper/build.py", line 228, in build_encoder
tensorrt_llm_whisper_encoder(*inputs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
output = self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1406, in forward
hidden_states = encoder_layer(hidden_states,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
output = self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 239, in forward
attention_output = self.attention(hidden_states,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
output = self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 1174, in forward
assert qkv.ndim() == 2
AssertionError
additional notes
tensorrt_llm/layers/attention.py code snippet:
if default_net().plugin_config.remove_input_padding:
assert qkv.ndim() == 2
actual shape of qkv:
BertAttention.forward.qkv.shape = (-1, 1500, 3840)
without --remove_input_padding option, everything works fine and as expected
### Tasks
@lightbooster Hi, whisper has not supported this option yet. I would update here if it works or when we could remove the 30s restrictions.
@lightbooster 嗨,whisper 尚未支持此选项。如果它有效或者我们可以移除 30 秒的限制,我会在这里更新。
if this is supported now?
if this is supported now?
Currently, for the distill-whispr or fine-tuned Whisper models, it is possible to configure audio other than 30 seconds. The --remove-input-padding option is also supported, but it does not actually remove the padding internally; it only supports the input and output format. Support for arbitrary length audio input has not yet been implemented.
尚未实现对任意长度音频输入的支持。
We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?
We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?
It will increase computation, but it won't add too much because a large part of the model's time consumption is determined by the number of autoregressive steps of the decoder. Padding does not increase this number.
By the way, because training and inference must be consistent, the accuracy of the native Whisper model will be compromised if the input is audio other than 30 seconds.
We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?
It will increase computation, but it won't add too much because a large part of the model's time consumption is determined by the number of autoregressive steps of the decoder. Padding does not increase this number.
By the way, because training and inference must be consistent, the accuracy of the native Whisper model will be compromised if the input is audio other than 30 seconds.
Thanks for the answer. We removed padding during training.