TensorRT-LLM 100% WER on distil-whisper/distil-large-v2

System Info

DGX V100 and DGX A100

Who can help?

@ncomly-nvidia to add more folks.

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Followed the whisper example. Got example engines working on A100 80GB and V100-16GB. To save the HF model in bin format I did:

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, pipeline
import torch
from datasets import load_dataset, load_from_disk

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=False
)

model.save_pretrained('./distil-whisper/distil-large-v2', safe_serialization=False)

I had to download the mel_filters.npz and gpt2.tiktoken separately per the directions.

Example build and run cmds:

output_dir=distil_whisper_large_v2
python3 build.py --model_dir /workspace/models/whisper/assets/ --model_name distil-large-v2 --output_dir $output_dir --dtype float16 --enable_context_fmha --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin float16 


python3 run.py --engine_dir $output_dir --dataset hf-internal-testing/librispeech_asr_dummy --name librispeech_dummy_output --tokenizer_name gpt2 --assets_dir /models/whisper/assets/ --dataset librispeech_asr --results_dir /models/whisper/results

Expected behavior

Not get >100% WER on librispeech_asr :)

actual behavior

in errs-librispeech.txt

%WER = 150.73 Errors: 28722 insertions, 3162 deletions, 50714 substitutions, over 54798 reference words (922 correct) Search below for sections starting with PER-UTT DETAILS:, SUBSTITUTIONS:, DELETIONS:, INSERTIONS:, PER-WORD STATS:

in rtf-librispeech.txt

RTF: 0.0098 total_duration: 19396.121 seconds (5.39 hours) processing time: 189.115 seconds (0.05 hours) batch size: 4 num_beams: 1

additional notes

n/a

May 17 '24 02:05 esnvidia

Did you use the file first https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/distil_whisper/convert_from_distil_whisper.py ?

See https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper#distil-whisper, you may need to convert huggingface checkpoint first.

@esnvidia

May 21 '24 08:05 yuekaizhang

Yes, here's the exact steps I ran:

https://github.com/esnvidia/distil_whisper_hf2_triton

From: Yuekai Zhang @.> Sent: Tuesday, May 21, 2024 4:52:15 AM To: NVIDIA/TensorRT-LLM @.> Cc: Emanuel Scoullos @.>; Mention @.> Subject: Re: [NVIDIA/TensorRT-LLM] 100% WER on distil-whisper/distil-large-v2 (Issue #1620)

Did you use the file first https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/distil_whisper/convert_from_distil_whisper.py ?

See https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper#distil-whisper, you may need to convert huggingface checkpoint first.

@esnvidiahttps://github.com/esnvidia

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/TensorRT-LLM/issues/1620#issuecomment-2122109459, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ATIYP6OIAK77BZJP7QOKJWLZDMDLTAVCNFSM6AAAAABH3JAE3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGEYDSNBVHE. You are receiving this because you were mentioned.Message ID: @.***>

May 21 '24 08:05 esnvidia

The test step:

python run.py --engine_dir $output_diry --name librispeech_dummy_output --tokenizer_name gpt2 --assets_dir ./assets/ --dataset librispeech_asr --results_dir ./results

Needs a little tweak to the cmd but should be simple for you to figure out.

From: Emanuel Scoullos @.> Sent: Tuesday, May 21, 2024 4:56:07 AM To: NVIDIA/TensorRT-LLM @.>; NVIDIA/TensorRT-LLM @.> Cc: Mention @.> Subject: Re: [NVIDIA/TensorRT-LLM] 100% WER on distil-whisper/distil-large-v2 (Issue #1620)

Yes, here's the exact steps I ran:

https://github.com/esnvidia/distil_whisper_hf2_triton

From: Yuekai Zhang @.> Sent: Tuesday, May 21, 2024 4:52:15 AM To: NVIDIA/TensorRT-LLM @.> Cc: Emanuel Scoullos @.>; Mention @.> Subject: Re: [NVIDIA/TensorRT-LLM] 100% WER on distil-whisper/distil-large-v2 (Issue #1620)

Did you use the file first https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/distil_whisper/convert_from_distil_whisper.py ?

See https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper#distil-whisper, you may need to convert huggingface checkpoint first.

@esnvidiahttps://github.com/esnvidia

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/TensorRT-LLM/issues/1620#issuecomment-2122109459, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ATIYP6OIAK77BZJP7QOKJWLZDMDLTAVCNFSM6AAAAABH3JAE3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGEYDSNBVHE. You are receiving this because you were mentioned.Message ID: @.***>

May 21 '24 08:05 esnvidia

Oh, I see for distill-large-v2, you should use the default multilingual tokenizer rather than gpt2. @esnvidia

May 21 '24 09:05 yuekaizhang

Yes, here's the exact steps I ran: https://github.com/esnvidia/distil_whisper_hf2_triton

Also, you are welcome to contribute this triton model_repo for distil whisper to sherpa/triton/whisper if you have some free time.

May 21 '24 09:05 yuekaizhang

@yuekaizhang Are you sure it's multilingual? The step in the example shows gpt2:

here is the cmd

python3 run.py --engine_dir $output_dir --dataset hf-internal-testing/librispeech_asr_dummy --name librispeech_dummy_${output_dir} --tokenizer_name gpt2

as well as this step:

# download the gpt2.tiktoken
wget --directory-prefix=assets https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/gpt2.tiktoken

May 21 '24 18:05 esnvidia

@yuekaizhang confirmed the need for mulitlingual. This needs to be updated in the docs.

May 21 '24 20:05 esnvidia

@yuekaizhang confirmed the need for mulitlingual. This needs to be updated in the docs.

Updated it. Now users don't need to specify tokenizer_name by themselves.

May 22 '24 11:05 yuekaizhang

Awesome, but I still don't see the change reflected in the main branch. I'm looking here: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper#distil-whisper

Is there a PR tied to this?

Also getting 100% WER using the Triton-ASR-Client by the way. Let me know if you want me to file an issue there. I think it simply involves copying the functions from the run.py here since I was able to get the 3% WER with that.

I can contribute to sherpa etc once this works E2E. :)

May 22 '24 13:05 esnvidia

Is there a PR tied to this?

Yes. I have updated in the gitlab. It will sync to github several days later.

Also getting 100% WER using the Triton-ASR-Client by the way. Let me know if you want me to file an issue there. I think it simply involves copying the functions from the run.py here since I was able to get the 3% WER with that.

https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#benchmark-using-dataset Could you try --whisper-prompt "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>" . If it can't work, you may file a issue under sherpa, and attach more details. I will investigate at there.

I can contribute to sherpa etc once this works E2E. :)

That sounds great. @esnvidia

May 23 '24 01:05 yuekaizhang

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Jun 23 '24 01:06 github-actions[bot]

TensorRT-LLM TensorRT-LLM copied to clipboard

100% WER on distil-whisper/distil-large-v2

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

TensorRT-LLM
TensorRT-LLM copied to clipboard