distil-whisper Tiny model?

Hi, are there any plans to train a tiny distilled Whisper model? It would be very interesting to see how fast it would go, as I'd like to use it in phones.

Nov 03 '23 09:11 soupslurpr

Thanks for your interest! We'll start with the small model and work our way down!

Nov 03 '23 14:11 sanchit-gandhi

Cool thanks, looking forward to it and seeing the results!

Nov 03 '23 17:11 soupslurpr

Feel free to follow along progress! https://wandb.ai/sanchit-gandhi/distil-whisper?workspace=user-sanchit-gandhi

Nov 03 '23 18:11 sanchit-gandhi

Still ongoing - had some difficulties streaming data from the HF Hub the past week. We're training a 2-layer and 4-layer variant of the small model! Will then move onto the base model

Nov 13 '23 14:11 sanchit-gandhi

distil-small.en is released here: https://huggingface.co/distil-whisper/distil-small.en

It's quite hard to compress further than this without loosing WER performance: https://huggingface.co/distil-whisper/distil-small.en#why-is-distil-smallen-slower-than-distil-large-v2

Dec 07 '23 19:12 sanchit-gandhi

That's great, its still faster than the normal small.en right? Also will distilling base.en and/or tiny.en be tried still or no? Thanks.

Dec 08 '23 00:12 soupslurpr

distil-small.en is released here: https://huggingface.co/distil-whisper/distil-small.en

It's quite hard to compress further than this without loosing WER performance: https://huggingface.co/distil-whisper/distil-small.en#why-is-distil-smallen-slower-than-distil-large-v2

Is there any way we can access the small 2-layer decoder variant?

Dec 19 '23 00:12 mitchelldehaven

Does it make sense to have a 2-layer decoder version for speculative decoding in combination with distil-large?

Dec 20 '23 16:12 hidoba

Yep - distil-small.en is about 2x faster than small.en on short-form evaluation. I personally won't try distilling base.en or tiny.en, since it's quite hard to retain performance for these smaller models, but would encourage you to try by leveraging the training code if this is of interest to you!

Jan 17 '24 15:01 sanchit-gandhi

Is there any way we can access the small 2-layer decoder variant?

Yes, c.f. https://huggingface.co/distil-whisper/distil-small.en

Jan 17 '24 15:01 sanchit-gandhi

Does it make sense to have a 2-layer decoder version for speculative decoding in combination with distil-large?

I would think not: we want out main model to be as accurate as possible for speculative decoding to get the lowest possible WER results (i.e. use the teacher Whisper large-v2 model). It doesn't really matter how fast the main model is, since we only do validation forward passes with it. The auto-regressive bottle neck is handled by the assistant model, so there's little gain from using a faster main model. So we should pick the most accurate main model, and an assistant model that is much faster and predicts the correct token ids 70-80% of the time.

Jan 17 '24 15:01 sanchit-gandhi

Is there any way we can access the small 2-layer decoder variant?

Yes, c.f. https://huggingface.co/distil-whisper/distil-small.en

@sanchit-gandhi From https://huggingface.co/distil-whisper/distil-small.en:

While distil-medium.en and distil-large-v2 use two decoder layers each, distil-small.en uses four.

It sounds like the version there is the 4-layer version. Am I missing a way to get the 2-layer version from that?

Jan 17 '24 20:01 mitchelldehaven

distil-whisper distil-whisper copied to clipboard

Tiny model?

distil-whisper
distil-whisper copied to clipboard