WhisperS2T TensorRT - avg

Thanks for your really impressive work.

I was wondering how to extract the token probability with TensorRT (a little bit lit what you did in this example with ctranslate2)

import whisper_s2t

model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2')

files = ['data/KINCAID46/audio/1.wav']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]

out = model.transcribe_with_vad(files,
                                lang_codes=lang_codes,
                                tasks=tasks,
                                initial_prompts=initial_prompts,
                                batch_size=32)

print(out[0][0])
"""
[Console Output]

{'text': "Let's bring in Phil Mackie who is there at the palace. We're looking at Teresa and Philip May. Philip, can you see how he's being transferred from the helicopters? It looks like, as you said, the beast. It's got its headlights on because the sun is beginning to set now, certainly sinking behind some clouds. It's about a quarter of a mile away down the Grand Drive",
 'avg_logprob': -0.25426941679184695,
 'no_speech_prob': 8.147954940795898e-05,
 'start_time': 0.0,
 'end_time': 24.8}
"""

Feb 20 '24 13:02 OValery16

I'm inquiring about this because having access to this information (the scores) would enable us to implement a language detection method.

Feb 20 '24 21:02 OValery16

@OValery16 https://github.com/NVIDIA/TensorRT-LLM/issues/1127

Feb 21 '24 17:02 shashikg

Thanks for your really impressive work.

I was wondering how to extract the token probability with TensorRT (a little bit lit what you did in this example with ctranslate2)

import whisper_s2t

model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2')

files = ['data/KINCAID46/audio/1.wav']
lang_codes = ['en']
tasks = ['transcribe']
initial_prompts = [None]

out = model.transcribe_with_vad(files,
                                lang_codes=lang_codes,
                                tasks=tasks,
                                initial_prompts=initial_prompts,
                                batch_size=32)

print(out[0][0])
"""
[Console Output]

{'text': "Let's bring in Phil Mackie who is there at the palace. We're looking at Teresa and Philip May. Philip, can you see how he's being transferred from the helicopters? It looks like, as you said, the beast. It's got its headlights on because the sun is beginning to set now, certainly sinking behind some clouds. It's about a quarter of a mile away down the Grand Drive",
 'avg_logprob': -0.25426941679184695,
 'no_speech_prob': 8.147954940795898e-05,
 'start_time': 0.0,
 'end_time': 24.8}
"""

You may check this https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/generation.py#L2267. You may also need to register logit as one of the output tensors.

Feb 22 '24 05:02 yuekaizhang

Thanks for your tips. I don't really see how to do it without modifying TensorRT-llm library. Do you know how to do it ?

Mar 01 '24 10:03 OValery16

WhisperS2T
WhisperS2T copied to clipboard

TensorRT - avg_logprob

WhisperS2T WhisperS2T copied to clipboard

TensorRT - avg_logprob

WhisperS2T
WhisperS2T copied to clipboard