distil-whisper
distil-whisper copied to clipboard
Tiny model?
Hi, are there any plans to train a tiny distilled Whisper model? It would be very interesting to see how fast it would go, as I'd like to use it in phones.
Thanks for your interest! We'll start with the small
model and work our way down!
Cool thanks, looking forward to it and seeing the results!
Feel free to follow along progress! https://wandb.ai/sanchit-gandhi/distil-whisper?workspace=user-sanchit-gandhi
Still ongoing - had some difficulties streaming data from the HF Hub the past week. We're training a 2-layer and 4-layer variant of the small model! Will then move onto the base model
distil-small.en
is released here: https://huggingface.co/distil-whisper/distil-small.en
It's quite hard to compress further than this without loosing WER performance: https://huggingface.co/distil-whisper/distil-small.en#why-is-distil-smallen-slower-than-distil-large-v2
That's great, its still faster than the normal small.en right? Also will distilling base.en and/or tiny.en be tried still or no? Thanks.
distil-small.en
is released here: https://huggingface.co/distil-whisper/distil-small.enIt's quite hard to compress further than this without loosing WER performance: https://huggingface.co/distil-whisper/distil-small.en#why-is-distil-smallen-slower-than-distil-large-v2
Is there any way we can access the small 2-layer decoder variant?
Does it make sense to have a 2-layer decoder version for speculative decoding in combination with distil-large?
Yep - distil-small.en
is about 2x faster than small.en
on short-form evaluation. I personally won't try distilling base.en
or tiny.en
, since it's quite hard to retain performance for these smaller models, but would encourage you to try by leveraging the training code if this is of interest to you!
Is there any way we can access the small 2-layer decoder variant?
Yes, c.f. https://huggingface.co/distil-whisper/distil-small.en
Does it make sense to have a 2-layer decoder version for speculative decoding in combination with distil-large?
I would think not: we want out main model to be as accurate as possible for speculative decoding to get the lowest possible WER results (i.e. use the teacher Whisper large-v2 model). It doesn't really matter how fast the main model is, since we only do validation forward passes with it. The auto-regressive bottle neck is handled by the assistant model, so there's little gain from using a faster main model. So we should pick the most accurate main model, and an assistant model that is much faster and predicts the correct token ids 70-80% of the time.
Is there any way we can access the small 2-layer decoder variant?
Yes, c.f. https://huggingface.co/distil-whisper/distil-small.en
@sanchit-gandhi From https://huggingface.co/distil-whisper/distil-small.en:
While distil-medium.en and distil-large-v2 use two decoder layers each, distil-small.en uses four.
It sounds like the version there is the 4-layer version. Am I missing a way to get the 2-layer version from that?