distil-whisper Fix previous text prepending

Fix previous text prepending

Open bofenghuang opened this issue 7 months ago • 0 comments

Hi 👋,

Thank you for continuously adding more features to the Whisper distillation code!

As I reviewed the section on prepending previous text during the preparation of training data, I made the following adjustments based on my interpretation:

Moved the prepending of decoder_prev_token_id to the end to ensure it's always triggered, even when prev_ids aren't cut by the previous two conditions
Updated the total length check to len(prev_ids + token_ids) + 1, which now includes decoder_prev_token_id since it's always added
Removed prev_ids from the trim_length calculation. For instance, with 3 prev_ids and 3 token_ids and a max_label_length of 6, we should retain only the last 2 tokens in prev_ids, calculated as max_label_length - len(token_ids) - 1 = 6 - 3 - 1 = 2

Jul 09 '24 11:07 bofenghuang

distil-whisper distil-whisper copied to clipboard

Fix previous text prepending

distil-whisper
distil-whisper copied to clipboard