whisperX Support inference from fine-tuned 🤗 transformers Whisper models

Hi @m-bain,

This is a very cool repository and definitely useful for getting more reliable and accurate timestamps for the generated transcriptions. I was wondering if you'd like to extend the current transcription codebase to also support transformers fine-tuned Whisper checkpoints as well.

For context, we recently ran a Whisper fine-tuning event powered by 🤗 transformers and over the course of the event we managed to fine-tune 650+ Whisper checkpoints, across 112 languages. You can find the leaderboard here: https://huggingface.co/spaces/whisper-event/leaderboard

In most all the cases the fine-tuned models beat the original Whisper model's zero shot performance by a huge margin.

I think It'll be of huge benefit for the community to be able to utilise these models with your repo. Happy to support you if you have any questions from 🤗 transformers side. :)

Cheers, VB

Jan 08 '23 10:01 Vaibhavs10

I see, what are the huggingface whisper outputs? If it outputs a list of dictionaries with "text", "start", and "end", then it can just feed into whisperx.align.

see

import whisperx

device = "cuda" 
audio_file = "audio.mp3"

# transcribe with original whisper / or huggingface finetuned
model = whisperx.load_model("large", device)
result = model.transcribe(audio_file)
# where result["segments"] is  List[Dict{"text": str, "start": float (seconds), "end": float (seconds)}]

# load alignment model and metadata
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)

# align whisper output
result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)

print(result_aligned["segments"]) # after alignment
print(result_aligned["word_segments"]) # after alignment

Or are you thinking CLI with huggingface model?

Jan 08 '23 14:01 m-bain

I see, what are the huggingface whisper outputs? If it outputs a list of dictionaries with "text", "start", and "end", then it can just feed into whisperx.align.

see

import whisperx

device = "cuda" 
audio_file = "audio.mp3"

# transcribe with original whisper / or huggingface finetuned
model = whisperx.load_model("large", device)
result = model.transcribe(audio_file)
# where result["segments"] is  List[Dict{"text": str, "start": float (seconds), "end": float (seconds)}]

# load alignment model and metadata
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)

# align whisper output
result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)

print(result_aligned["segments"]) # after alignment
print(result_aligned["word_segments"]) # after alignment

Or are you thinking CLI with huggingface model?

Huggingface whisper only outputs the text transcribed from the file without the time stamps.

Jan 11 '23 07:01 stephenasuncionDEV

Thanks for the detailed information @m-bain. As @stephenasuncionDEV mentioned, currently, the whisper implementation in transformers does not support timestamps. However, we are working on adding the support, you can check the PR here: https://github.com/huggingface/transformers/pull/20620

I'll ping back once we have this merged! Thanks again for your support.

Jan 12 '23 13:01 Vaibhavs10

v3 is using faster-whisper backend, which can use finetuned whisper weights https://github.com/guillaumekln/faster-whisper/blob/d889345e071de21a83bdae60ba4b07110cfd0696/README.md?plain=1#L142

feel free to add pull request to add this functionality, would require sending custom model_path

May 01 '23 11:05 m-bain

@Vaibhavs10 I see the PR is merged already just FYI :)))

Sep 13 '23 08:09 sabuhigr

How do I load a fine-tuned Whisper model after the model conversion? Can someone provide an example of how to do this?

Mar 10 '24 14:03 imashoksundar