whisper.cpp
whisper.cpp copied to clipboard
[Distil-Whisper] Add support for Distil-Whisper
Hey,
We've recently released two Distil-Whisper checkpoints:
- Large-v2-32-2 which is a 32-encoder layer, 2-decoder layer distilled large-v2 checkpoint
- Medium-24-2.en which is a 24-encoder layer, 2-decoder layer distilled medium.en checkpoint
On GPU, we achieve speed-ups of up to 6x compared to the teacher models at relatively minimal degradation in performance. More information here: https://twitter.com/sanchitgandhi99/status/1719409022246220184
Using your conversion scripts, we've already converted the checkpoints to .cpp format see:
We'd love to collaborate on supporting the checkpoints for this repository as we're really excited to see about the potential speed-ups that can be achieved on optimized C++ code.
It looks like some changes to whisper.cpp
will be necessary for such a change (e.g. we should probably define a new model type here?)
@ggerganov would you be interested in adding Distil-Whisper?
Linking for visibility: https://github.com/ggerganov/whisper.cpp/discussions/1414
Hi @patrickvonplaten - congrats on the release!
I believe I have successfully added initial support for the distilled models in the following PR: https://github.com/ggerganov/whisper.cpp/pull/1424
However, I'm worried that for optimal quality, AFAICT these models require an alternative decoding strategy with overlapping chunks for long-form transcriptions. This can take more time to implement and I am not sure yet how to fit it in the existing implementation.
Could you point me to the reference implementation?
I will give it a thought and see if I can come up with a solution in the following days. For the moment, #1424 should hopefully work as an initial version
Hey @ggerganov,
The implementation we're using in Transformers actually uses overlapping chunks. We overlap each chunk by 2.5 seconds. Essentially we follow the strategy as described here: https://huggingface.co/blog/asr-chunking using a chunk length of 15 seconds and chunk_stride of 2.5 second (default).
It's all implemented here: https://github.com/huggingface/transformers/blob/ac5d4cf6de24b4f7fa92996e92d1d71dd5411a6a/src/transformers/pipelines/automatic_speech_recognition.py#L135 and the code to run in inference for debugging should be this one: https://github.com/huggingface/distil-whisper/tree/main#long-form-transcription
The other option is to just use openai's codebase: https://github.com/openai/whisper using distil-whisper checkpoints converted into the original format: https://huggingface.co/distil-whisper/distil-large-v2/blob/main/original-model.fp32.bin
Does this help? I'm also working on adding OAI's naively to Transformers for easier debugging but this might take until next week
Thanks for the links. Will probably look into chunking after I make the v1.5.0
release of whisper.cpp
.
i would like to weigh in from the "end user peanut gallery" that i believe the full implementation of the chunking for distil-whisper
would be a major inflection point for the widespread adoption of whisper.cpp. qualitatively, the recent speed improvements were able to help products like MacWhisper get to a point where consumer hardware (M1) can now transcribe short audio faster than you can upload/transcribe/download via a cloud service like Otter or Happyscribe. if we can get the extra 5-6x from distil-whisper
, then even hours long transcriptions of meetings, podcasts, etc, could be transcribed in minutes to tens of minutes on consumer hardware (with respectable accuracy (medium or large))
of course everyone would rather transcribe locally for privacy and cost reasons. you have the power to make this practical. everyone will have their own private transcriptionist. we don't need another 10x to make this a UX inflection, just another 5x will seriously change the game.
thank you for the important work that you do!
I haven't managed to run the conversion scripts myself (see #1711).
Is there any chance you could release additional versions, using the GGUF format with the recent quantization options?
any chances for this to support with https://huggingface.co/Aspik101/distil-whisper-large-v3-pl ?
I'd love to see this as well. The distil models run so much faster but unfortunately for anything longer than 10-20 seconds, it starts cutting out words/phrases. I tested against a distil model using regular Whisper here https://huggingface.co/spaces/distil-whisper/whisper-vs-distil-whisper with the same audio file and it works nearly flawlessly. But for some reason using it through whisper.cpp creates a large number of errors and words that are cut off or misspelled (I'm assuming it's because it's chunking oddly). Would love to see this fixed.
@patrickvonplaten with the latest release of Distilled V3 my understanding is that Distilled model is no longer exclusively tied to the chunked algorithm as far as I can understand https://huggingface.co/distil-whisper/distil-large-v3 https://huggingface.co/distil-whisper/distil-large-v3-ggml
So maybe this ticket could be closed? I suppose it mainly remained open to address the chunking?