distil-whisper icon indicating copy to clipboard operation
distil-whisper copied to clipboard

Fix previous text prepending

Open bofenghuang opened this issue 7 months ago • 0 comments

Hi 👋,

Thank you for continuously adding more features to the Whisper distillation code!

As I reviewed the section on prepending previous text during the preparation of training data, I made the following adjustments based on my interpretation:

  1. Moved the prepending of decoder_prev_token_id to the end to ensure it's always triggered, even when prev_ids aren't cut by the previous two conditions
  2. Updated the total length check to len(prev_ids + token_ids) + 1, which now includes decoder_prev_token_id since it's always added
  3. Removed prev_ids from the trim_length calculation. For instance, with 3 prev_ids and 3 token_ids and a max_label_length of 6, we should retain only the last 2 tokens in prev_ids, calculated as max_label_length - len(token_ids) - 1 = 6 - 3 - 1 = 2

bofenghuang avatar Jul 09 '24 11:07 bofenghuang