WhisperKit icon indicating copy to clipboard operation
WhisperKit copied to clipboard

Reduce redundant decoder forward passes by leveraging word-level timestamps

Open atiorh opened this issue 11 months ago • 0 comments

The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:

  • Current behavior is to seek the audio forward if <|endoftext|> is generated or max_tokens tokens are generated.
  • Current behavior results in wasteful compute because each text token is re-decoded until the audio seeks beyond them.
  • This is up to 29 times redundant (worst case) for a 1 second audio refresh rate and a 30 second audio window for Whisper.

atiorh avatar Mar 07 '24 21:03 atiorh