WhisperKit Reduce redundant decoder forward passes by leveraging word-level timestamps

Reduce redundant decoder forward passes by leveraging word-level timestamps

Open atiorh opened this issue 11 months ago • 0 comments

The goal is to leverage the high-quality word-level timestamps added in #38 as anchors to reliably seek the audio buffer forward at a higher frequency compared to current behavior:

Current behavior is to seek the audio forward if <|endoftext|> is generated or max_tokens tokens are generated.
Current behavior results in wasteful compute because each text token is re-decoded until the audio seeks beyond them.
This is up to 29 times redundant (worst case) for a 1 second audio refresh rate and a 30 second audio window for Whisper.

Mar 07 '24 21:03 atiorh

WhisperKit WhisperKit copied to clipboard

Reduce redundant decoder forward passes by leveraging word-level timestamps

WhisperKit
WhisperKit copied to clipboard