distil-whisper icon indicating copy to clipboard operation
distil-whisper copied to clipboard

Tiny model?

Open soupslurpr opened this issue 1 year ago • 12 comments

Hi, are there any plans to train a tiny distilled Whisper model? It would be very interesting to see how fast it would go, as I'd like to use it in phones.

soupslurpr avatar Nov 03 '23 09:11 soupslurpr

Thanks for your interest! We'll start with the small model and work our way down!

sanchit-gandhi avatar Nov 03 '23 14:11 sanchit-gandhi

Cool thanks, looking forward to it and seeing the results!

soupslurpr avatar Nov 03 '23 17:11 soupslurpr

Feel free to follow along progress! https://wandb.ai/sanchit-gandhi/distil-whisper?workspace=user-sanchit-gandhi

sanchit-gandhi avatar Nov 03 '23 18:11 sanchit-gandhi

Still ongoing - had some difficulties streaming data from the HF Hub the past week. We're training a 2-layer and 4-layer variant of the small model! Will then move onto the base model

sanchit-gandhi avatar Nov 13 '23 14:11 sanchit-gandhi

distil-small.en is released here: https://huggingface.co/distil-whisper/distil-small.en

It's quite hard to compress further than this without loosing WER performance: https://huggingface.co/distil-whisper/distil-small.en#why-is-distil-smallen-slower-than-distil-large-v2

sanchit-gandhi avatar Dec 07 '23 19:12 sanchit-gandhi

That's great, its still faster than the normal small.en right? Also will distilling base.en and/or tiny.en be tried still or no? Thanks.

soupslurpr avatar Dec 08 '23 00:12 soupslurpr

distil-small.en is released here: https://huggingface.co/distil-whisper/distil-small.en

It's quite hard to compress further than this without loosing WER performance: https://huggingface.co/distil-whisper/distil-small.en#why-is-distil-smallen-slower-than-distil-large-v2

Is there any way we can access the small 2-layer decoder variant?

mitchelldehaven avatar Dec 19 '23 00:12 mitchelldehaven

Does it make sense to have a 2-layer decoder version for speculative decoding in combination with distil-large?

hidoba avatar Dec 20 '23 16:12 hidoba

Yep - distil-small.en is about 2x faster than small.en on short-form evaluation. I personally won't try distilling base.en or tiny.en, since it's quite hard to retain performance for these smaller models, but would encourage you to try by leveraging the training code if this is of interest to you!

sanchit-gandhi avatar Jan 17 '24 15:01 sanchit-gandhi

Is there any way we can access the small 2-layer decoder variant?

Yes, c.f. https://huggingface.co/distil-whisper/distil-small.en

sanchit-gandhi avatar Jan 17 '24 15:01 sanchit-gandhi

Does it make sense to have a 2-layer decoder version for speculative decoding in combination with distil-large?

I would think not: we want out main model to be as accurate as possible for speculative decoding to get the lowest possible WER results (i.e. use the teacher Whisper large-v2 model). It doesn't really matter how fast the main model is, since we only do validation forward passes with it. The auto-regressive bottle neck is handled by the assistant model, so there's little gain from using a faster main model. So we should pick the most accurate main model, and an assistant model that is much faster and predicts the correct token ids 70-80% of the time.

sanchit-gandhi avatar Jan 17 '24 15:01 sanchit-gandhi

Is there any way we can access the small 2-layer decoder variant?

Yes, c.f. https://huggingface.co/distil-whisper/distil-small.en

@sanchit-gandhi From https://huggingface.co/distil-whisper/distil-small.en:

While distil-medium.en and distil-large-v2 use two decoder layers each, distil-small.en uses four.

It sounds like the version there is the 4-layer version. Am I missing a way to get the 2-layer version from that?

mitchelldehaven avatar Jan 17 '24 20:01 mitchelldehaven