distil-whisper
distil-whisper copied to clipboard
Fix previous text prepending
Hi 👋,
Thank you for continuously adding more features to the Whisper distillation code!
As I reviewed the section on prepending previous text during the preparation of training data, I made the following adjustments based on my interpretation:
- Moved the prepending of
decoder_prev_token_id
to the end to ensure it's always triggered, even whenprev_ids
aren't cut by the previous two conditions - Updated the total length check to
len(prev_ids + token_ids) + 1
, which now includesdecoder_prev_token_id
since it's always added - Removed
prev_ids
from thetrim_length
calculation. For instance, with 3prev_ids
and 3token_ids
and amax_label_length
of 6, we should retain only the last 2 tokens inprev_ids
, calculated asmax_label_length - len(token_ids) - 1
= 6 - 3 - 1 = 2