WhisperKit icon indicating copy to clipboard operation
WhisperKit copied to clipboard

VAD audio chunking

Open jkrukowski opened this issue 9 months ago • 5 comments

This PR introduces audio chunking with VAD. The VAD is used to detect speech segments in the audio file and then the audio is split into chunks based on the detected speech segments (and padded with zeros to match the 30sec length). Chunks are then processed in a batch resulting in a significant speedup.

Some benchmarks (on my mac book air m1):

Audio file 12:16 length

  • with VAD:
38.16s user 5.86s system 470% cpu 9.349 total
  • without VAD:
33.25s user 3.55s system 132% cpu 27.678 total

Audio file 40:26 length

  • with VAD:
126.54s user 18.41s system 500% cpu 28.952 total
  • without VAD:
96.55s user 10.47s system 133% cpu 1:20.08 total

To use it in WhisperKitCLI the user has to pass the chunking-strategy flag:

swift run -c release whisperkit-cli transcribe --audio-path /path/to/audio.wav --chunking-strategy vad

jkrukowski avatar May 06 '24 16:05 jkrukowski