lhotse Add faster-whisper (ctranslate2) as option for Whisper annotation workflow

This PR adds a second Whisper annotation workflow that uses faster-whisper powered by CTranslate2's implementation (see https://github.com/entn-at/lhotse/tree/feature/whisper-ctranslate2). It's a lot faster and uses far less memory.

This implementation also obtains word start and end times. I'm still investigating whether they are accurate enough in general to be used as alignments.

Apr 06 '23 05:04 entn-at

Would it be possible to combine whisper and faster-whisper into a single CLI/method and then add faster-whisper as an optional flag enabled/disabled by default? The internals of the function call can be kept separate, but from a user perspective, it makes more sense since they have the same functionality. I'm thinking of it as 2 backends with the same user-facing wrapper.

Apr 06 '23 13:04 desh2608

Thanks for the quick initial review! I combined whisper and faster-whisper into a single CLI/method with a --faster-whisper flag. I also added additional off-by-default feature flags for faster-whisper:

--faster-whisper-add-alignments: Whether to use faster-whisper's built-in method for obtaining word alignments (using cross-attention pattern and dynamic time warping; generally not as accurate as forced alignment).
--faster-whisper-use-vad: Whether to apply speech activity detection (SileroVAD) before Whisper to reduce repetitions/spurious transcriptions (what is often referred to as "hallucinations").
--faster-whisper-num-workers: Number of workers for parallelization across multiple GPUs.

Quick benchmark on mini-librispeech dev-clean-2:

OpenAI Whisper, RTX2080Ti:

$ time lhotse workflows annotate-with-whisper -n large-v2 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real    44m31.647s
user    46m5.540s
sys     0m10.869s

faster-whisper/ctranslate2, float16 on RTX2080Ti:

time lhotse workflows annotate-with-whisper --faster-whisper -n large-v2 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real    18m15.743s
user    34m47.594s
sys     30m18.775s

faster-whisper allows parallelization across multiple GPUs. With --faster-whisper-num-workers 4 on 4x RTX2080Ti:

$ time lhotse workflows annotate-with-whisper --faster-whisper -n large-v2 --faster-whisper-num-workers 4 -l en -m librispeech_recordings_dev-clean-2.jsonl.gz --device "cuda" librispeech_cuts_dev-clean-2.jsonl.gz
real    6m34.545s
user    35m50.779s
sys     25m48.421s

~~The only incompatibility with the current Whisper method is that faster-whisper doesn't expose a way to set the download location for the models. I submitted a PR to faster-whisper and once that's merged/published in a new version, the currently commented-out line 116 in faster_whisper.py can be changed to enable that.~~ PR to faster-whisper has been merged.

Apr 08 '23 06:04 entn-at

I quickly compared the results between old and new whisper implementations on a 60s clip from AMI. In that clip, I noticed that faster-whisper tends to skip short, isolated, and noisy utterances such as "Okay" or "Thank you", probably due to VAD (which is OK I guess). However the time boundaries seem off when you compare it to the original implementation, please see the screenshot. Do you think it's possible to fix it? Maybe more accurate information is exposed somewhere in faster-whisper and it's just not being used here? Otherwise there's a lot of silence/non-speech included in the supervisions.

Note: the top plot is from the original Whisper, and the bottom plot is from faster-whisper.

Apr 11 '23 19:04 pzelasko

Sorry for the delay, I've been quite busy. I'll pick this up shortly and address the requested changes.

May 04 '23 14:05 entn-at

@entn-at any updates on this?

Jul 31 '23 15:07 desh2608

lhotse lhotse copied to clipboard

Add faster-whisper (ctranslate2) as option for Whisper annotation workflow

lhotse
lhotse copied to clipboard