TensorRT-LLM Whisper build fails with `--remove_input

System Info

GPU V100, A100
docker image nvidia/cuda:12.1.0-devel-ubuntu22.04
tensorrt-llm 0.9.0.dev2024020600

Who can help?

@byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

running build.py script from examples/whisper with --remove_input_padding option:

python3 build.py --output_dir whisper_large_v3_no_pad --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --model_dir /assets/ --model_name large-v3 --remove_input_padding

Expected behavior

serialized engine expected, without necessity for padding batch to 30 seconds samples [batch_size, n_mels, 3000]

actual behavior

actually build.py script will be failed with assertion in encoder attention layer:

Traceback (most recent call last):
  File "/TensorRT-LLM/examples/whisper/build.py", line 365, in <module>
    run_build(args)
  File "/TensorRT-LLM/examples/whisper/build.py", line 359, in run_build
    build_encoder(model, args)
  File "/TensorRT-LLM/examples/whisper/build.py", line 228, in build_encoder
    tensorrt_llm_whisper_encoder(*inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1406, in forward
    hidden_states = encoder_layer(hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 239, in forward
    attention_output = self.attention(hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 1174, in forward
    assert qkv.ndim() == 2
AssertionError

additional notes

tensorrt_llm/layers/attention.py code snippet:

if default_net().plugin_config.remove_input_padding:
            assert qkv.ndim() == 2

actual shape of qkv:

BertAttention.forward.qkv.shape =  (-1, 1500, 3840)

without --remove_input_padding option, everything works fine and as expected

### Tasks

Feb 14 '24 15:02 lightbooster

@lightbooster Hi, whisper has not supported this option yet. I would update here if it works or when we could remove the 30s restrictions.

Feb 19 '24 04:02 yuekaizhang

@lightbooster 嗨，whisper 尚未支持此选项。如果它有效或者我们可以移除 30 秒的限制，我会在这里更新。

if this is supported now?

Aug 07 '24 07:08 wangsang123

if this is supported now?

Currently, for the distill-whispr or fine-tuned Whisper models, it is possible to configure audio other than 30 seconds. The --remove-input-padding option is also supported, but it does not actually remove the padding internally; it only supports the input and output format. Support for arbitrary length audio input has not yet been implemented.

Aug 07 '24 07:08 yuekaizhang

尚未实现对任意长度音频输入的支持。

We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?

Aug 07 '24 07:08 wangsang123

We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?

It will increase computation, but it won't add too much because a large part of the model's time consumption is determined by the number of autoregressive steps of the decoder. Padding does not increase this number.

By the way, because training and inference must be consistent, the accuracy of the native Whisper model will be compromised if the input is audio other than 30 seconds.

Aug 07 '24 07:08 yuekaizhang

We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?

It will increase computation, but it won't add too much because a large part of the model's time consumption is determined by the number of autoregressive steps of the decoder. Padding does not increase this number.

By the way, because training and inference must be consistent, the accuracy of the native Whisper model will be compromised if the input is audio other than 30 seconds.

Thanks for the answer. We removed padding during training.

Aug 07 '24 08:08 wangsang123

TensorRT-LLM
TensorRT-LLM copied to clipboard

Whisper build fails with `--remove_input_padding` option

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM TensorRT-LLM copied to clipboard

Whisper build fails with `--remove_input_padding` option

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard