lhotse
lhotse copied to clipboard
Add faster-whisper (ctranslate2) as option for Whisper annotation workflow
This PR adds a second Whisper annotation workflow that uses faster-whisper powered by CTranslate2's implementation (see https://github.com/entn-at/lhotse/tree/feature/whisper-ctranslate2). It's a lot faster and uses far less memory.
This implementation also obtains word start and end times. I'm still investigating whether they are accurate enough in general to be used as alignments.
Would it be possible to combine whisper and faster-whisper into a single CLI/method and then add faster-whisper as an optional flag enabled/disabled by default? The internals of the function call can be kept separate, but from a user perspective, it makes more sense since they have the same functionality. I'm thinking of it as 2 backends with the same user-facing wrapper.
Thanks for the quick initial review! I combined whisper and faster-whisper into a single CLI/method with a --faster-whisper
flag. I also added additional off-by-default feature flags for faster-whisper:
-
--faster-whisper-add-alignments
: Whether to use faster-whisper's built-in method for obtaining word alignments (using cross-attention pattern and dynamic time warping; generally not as accurate as forced alignment). -
--faster-whisper-use-vad
: Whether to apply speech activity detection (SileroVAD) before Whisper to reduce repetitions/spurious transcriptions (what is often referred to as "hallucinations"). -
--faster-whisper-num-workers
: Number of workers for parallelization across multiple GPUs.
Quick benchmark on mini-librispeech dev-clean-2:
OpenAI Whisper, RTX2080Ti:
$ time lhotse workflows annotate-with-whisper -n large-v2 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real 44m31.647s
user 46m5.540s
sys 0m10.869s
faster-whisper/ctranslate2, float16
on RTX2080Ti:
time lhotse workflows annotate-with-whisper --faster-whisper -n large-v2 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real 18m15.743s
user 34m47.594s
sys 30m18.775s
faster-whisper
allows parallelization across multiple GPUs. With --faster-whisper-num-workers 4
on 4x RTX2080Ti:
$ time lhotse workflows annotate-with-whisper --faster-whisper -n large-v2 --faster-whisper-num-workers 4 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real 6m34.545s
user 35m50.779s
sys 25m48.421s
~~The only incompatibility with the current Whisper method is that faster-whisper doesn't expose a way to set the download location for the models. I submitted a PR to faster-whisper and once that's merged/published in a new version, the currently commented-out line 116 in faster_whisper.py can be changed to enable that.~~ PR to faster-whisper has been merged.
I quickly compared the results between old and new whisper implementations on a 60s clip from AMI. In that clip, I noticed that faster-whisper tends to skip short, isolated, and noisy utterances such as "Okay" or "Thank you", probably due to VAD (which is OK I guess). However the time boundaries seem off when you compare it to the original implementation, please see the screenshot. Do you think it's possible to fix it? Maybe more accurate information is exposed somewhere in faster-whisper and it's just not being used here? Otherwise there's a lot of silence/non-speech included in the supervisions.
Note: the top plot is from the original Whisper, and the bottom plot is from faster-whisper.

Sorry for the delay, I've been quite busy. I'll pick this up shortly and address the requested changes.
@entn-at any updates on this?