WhisperKit
WhisperKit copied to clipboard
Reduce redundant decoder forward passes by leveraging word-level timestamps
The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:
- Current behavior is to seek the audio forward if
<|endoftext|>
is generated ormax_tokens
tokens are generated. - Current behavior results in wasteful compute because each text token is re-decoded until the audio seeks beyond them.
- This is up to 29 times redundant (worst case) for a 1 second audio refresh rate and a 30 second audio window for Whisper.