faster-whisper
faster-whisper copied to clipboard
Benchmark faster whisper turbo v3
#WIP
Benchmark with faster-whisper-large-v3-turbo-ct2
For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations:
Large-v3 model on GPU
| Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory | WER % |
|---|---|---|---|---|---|---|
| openai/whisper-large-v3 | fp16 | 5 | 2m23s | MB | MB | |
| openai/whisper-turbo | fp16 | 5 | 39s | MB | MB | |
| faster-whisper | fp16 | 5 | 52.023s | 4521MB | 901MB | 2.883 |
| faster-whisper | int8 | 5 | 52.639s | 2953MB | 2261MB | 4.594 |
| faster-distil-large-v3 | fp16 | 5 | 26.126s | 2409MB | 900MB | 2.392 |
| faster-distil-large-v3 | int8 | 5 | 22.537s | 1481MB | 1468MB | 2.392 |
| faster-large-v3-turbo | fp16 | 5 | 19.155s | 2537MB | 899MB | 1.919 |
| faster-large-v3-turbo | int8 | 5 | 19.591s | 1545MB | 1526MB | 1.919 |
WER on librispeech clean val split.
We now support the new whisper-large-v3-turbo on Sieve!
Use it via sieve/speech_transcriber: https://www.sievedata.com/functions/sieve/speech_transcriber
Use sieve/whisper directly: https://www.sievedata.com/functions/sieve/whisper
Just set speed_boost to True. API guide is under "Usage Guide" tab.
Would be great if medium (or more sizes) are added for comparison! In OpenAI's implementation, turbo is 8x faster than v3 (medium is 2x and base is 7x), while offering similar WER as large-v2, which sounds surreal. Wonder how it translate to FW version.
Would be great if medium (or more sizes) are added for comparison! In OpenAI's implementation, turbo is 8x faster than v3 (medium is 2x and base is 7x), while offering similar WER as large-v2, which sounds surreal. Wonder how it translate to FW version.
In my test, with the same 10-minute audio, Medium took 52 seconds, and Turbo took 39 seconds.
You may find this discussion helpful: https://github.com/openai/whisper/discussions/2363#discussion-7264254
Compared to Medium, Turbo and large-v3 has a more advanced timeline. Subtitles generated by Turbo appear earlier, but end precisely on time. Medium subtitles also appear early, but to a much lesser extent than Turbo. However, Medium subtitles are delayed in disappearing.
I find that subtitles disappearing late is a better experience than them appearing early. I should still use medium.
Compared to Medium, Turbo and large-v3 has a more advanced timeline. Subtitles generated by Turbo appear earlier, but end precisely on time. Medium subtitles also appear early, but to a much lesser extent than Turbo. However, Medium subtitles are delayed in disappearing.
I find that subtitles disappearing late is a better experience than them appearing early. I should still use medium.
or you could use a forced alignment model after the transcription, much better timings than whisper
Great, thanks for your efforts! I hope turbo will be officially added here soon!
https://github.com/SYSTRAN/faster-whisper/blob/d57c5b40b06e59ec44240d93485a95799548af50/faster_whisper/utils.py#L12-L29
I benchmarked the models on my laptop using the same audio file in both sequential and batched processing. Seems that large-v3-turbo generally performs exceptionally well, offering greater accuracy than the base model while maintaining efficient processing times.
System Specifications
- CPU: Intel Core i7-12650H
- GPU: NVIDIA GeForce RTX 3060 Laptop (6 GB VRAM)
- RAM: SODIMM Samsung DDR4 8x2 GB 3200 MHz
Benchmark Details
- All models were tested with int8 precision.
- WER (Word Error Rate) was calculated by comparing the original French subtitles of a video with the transcriptions generated by the models.
- The language was explicitly set to French to prevent any translation errors or incorrect transcriptions.
Sequential Processing Benchmark
| Model | WER (%) | Total Time (s) | Transcribe Time (s) | Model Load Time (s) |
|---|---|---|---|---|
| tiny | 24.1% | 28.95 | 28.44 | 0.51 |
| base | 16.0% | 33.42 | 32.72 | 0.70 |
| small | 10.5% | 55.62 | 53.21 | 2.41 |
| medium | 10.7% | 113.25 | 106.30 | 6.95 |
| large | 17.6% | 240.52 | 227.31 | 13.20 |
| large-v1 | 8.7% | 168.58 | 155.14 | 13.44 |
| large-v2 | 8.5% | 178.28 | 164.74 | 13.53 |
| large-v3 | 17.6% | 230.77 | 217.43 | 13.34 |
| large-v3-turbo | 9.5% | 46.14 | 38.99 | 7.15 |
Observations:
- The
large-v3-turbomodel achieves a WER of 9.5%, which is significantly better than thebasemodel and comparable tolarge-v2. - In terms of speed,
large-v3-turbocompletes transcription in 38.99 seconds, much faster than other large models.
Batched Processing Benchmark
For batched processing, I used 10 batches for each model. I tried to use 16 batches, but some models thrown out-of-memory (OOM) errors due to the 6 GB VRAM limit.
| Model | WER (%) | Total Time (s) | Transcribe Time (s) | Model Load Time (s) |
|---|---|---|---|---|
| tiny | 23.6% | 5.48 | 4.56 | 0.92 |
| base | 16.5% | 6.92 | 5.70 | 1.22 |
| small | 9.8% | 12.45 | 9.98 | 2.47 |
| medium | 8.9% | 26.33 | 19.47 | 6.86 |
| large | 7.9% | 35.97 | 29.66 | 6.31 |
| large-v1 | 12.1% | 42.90 | 29.64 | 13.26 |
| large-v2 | 8.8% | 43.17 | 29.71 | 13.46 |
| large-v3 | 7.9% | 42.97 | 29.69 | 13.28 |
| large-v3-turbo | 7.7% | 18.68 | 11.47 | 7.20 |
Observations:
- With batched processing,
large-v3-turboachieves the best WER of 7.7%, outperforming all other models in both accuracy and speed. - The transcribe time for
large-v3-turbois 11.47 seconds, making it suitable for real-time applications even on a laptop GPU.
Conclusions
- The
large-v3-turbomodel offers an excellent balance between accuracy and processing speed, especially evident in batched processing scenarios. - It outperforms the
basemodel in terms of WER while maintaining significantly lower processing times compared to other large models.
Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")
segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
max_new_tokens=224,
beam_size=5,
temperature=0,
language="en",
word_timestamps=True,
vad_filter=False)
for seg in segments:
print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces
[0.00s -> 9.66s] ...to give the president a chance.
Whereas using medium:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")
segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
max_new_tokens=224,
beam_size=5,
temperature=0,
language="en",
word_timestamps=True,
vad_filter=False)
for seg in segments:
print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces
[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...
Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.
Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16") segments, info = model.transcribe("temp_audio.wav", initial_prompt="", max_new_tokens=224, beam_size=5, temperature=0, language="en", word_timestamps=True, vad_filter=False) for seg in segments: print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))produces
[0.00s -> 9.66s] ...to give the president a chance.
Whereas using medium:
from faster_whisper import WhisperModel model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16") segments, info = model.transcribe("temp_audio.wav", initial_prompt="", max_new_tokens=224, beam_size=5, temperature=0, language="en", word_timestamps=True, vad_filter=False) for seg in segments: print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))produces
[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...
Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.
Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.
Sharing the benchmarking results with Turbo model compared to other large whisper models on one of the biggest open source long-form ASR evaluation dataset. Our tests were conducted on a subset of YouTube-commons: yotutube-commons-asr-eval ).
| Model | WER | Speed |
|---|---|---|
| large v3-turbo | 13.40% | 129.5x |
| large v3 | 13.20% | 55.3x |
| large v2 | 14.10% | 54.6x |
| distill-large-v3 (en) | 15.00% | 142.9x |
The whisper-turbo model achieves a Word Error Rate (WER) similar to large models and excels in speed among multilingual models.
Also it does not seem to support translation task, even though it is mentioned. also tried it with Transformer large-v3 turbo, same behaviour.
Also it does not seem to support translation task, even though it is mentioned. also tried it with Transformer large-v3 turbo, same behaviour.
They specifically mentioned the translation task being excluded... https://github.com/openai/whisper/discussions/2363
excluding translation data, on which we don’t expect turbo to perform well.
oh okay. i looked at the huggingface page and there the translation task is mentioned. But maybe it was just copy pasted from the original large-v3 model.
Thanks for pointing that out.
Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16") segments, info = model.transcribe("temp_audio.wav", initial_prompt="", max_new_tokens=224, beam_size=5, temperature=0, language="en", word_timestamps=True, vad_filter=False) for seg in segments: print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))produces
[0.00s -> 9.66s] ...to give the president a chance.
Whereas using medium:
from faster_whisper import WhisperModel model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16") segments, info = model.transcribe("temp_audio.wav", initial_prompt="", max_new_tokens=224, beam_size=5, temperature=0, language="en", word_timestamps=True, vad_filter=False) for seg in segments: print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))produces
[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...
Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.
Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.
Interesting, did you try on the audio I provided here? It's actually remarkably consistent in how it's worse at short audio clips for me (with "deepdml/faster-whisper-large-v3-turbo-ct2)
Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16") segments, info = model.transcribe("temp_audio.wav", initial_prompt="", max_new_tokens=224, beam_size=5, temperature=0, language="en", word_timestamps=True, vad_filter=False) for seg in segments: print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))produces
[0.00s -> 9.66s] ...to give the president a chance.
Whereas using medium:
from faster_whisper import WhisperModel model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16") segments, info = model.transcribe("temp_audio.wav", initial_prompt="", max_new_tokens=224, beam_size=5, temperature=0, language="en", word_timestamps=True, vad_filter=False) for seg in segments: print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))produces
[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...
Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.
Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.
Interesting, did you try on the audio I provided here? It's actually remarkably consistent in how it's worse at short audio clips for me (with "deepdml/faster-whisper-large-v3-turbo-ct2)
Yeah, i did specifically try it on your clip with the same model. But i also did it with batched processing and not a sequential one, so i'm not sure if this is a specific issue with it or not. Batched processing with 6 - 10 batches works the best on my setup and actually provides more accurate transcriptions as you can see from my little benchmark earlier in this thread, so i use it for everything.
Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16") segments, info = model.transcribe("temp_audio.wav", initial_prompt="", max_new_tokens=224, beam_size=5, temperature=0, language="en", word_timestamps=True, vad_filter=False) for seg in segments: print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))produces
[0.00s -> 9.66s] ...to give the president a chance.
Whereas using medium:
from faster_whisper import WhisperModel model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16") segments, info = model.transcribe("temp_audio.wav", initial_prompt="", max_new_tokens=224, beam_size=5, temperature=0, language="en", word_timestamps=True, vad_filter=False) for seg in segments: print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))produces
[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...
Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.
Btw this happens on the non turbo v3 model as well, I've tried this with a lot of audio files of variable length and it happens a lot. So I've rolled back to v2 model.
Thank you so much for recording these benchmarks, I almost cannot believe the speed and quality of these models.
I have one request - for anyone reading this who is benchmarking.
Could some of the people performing benchmarks please record the hardware they are benchmarking on if possible? GPU, CPU and RAM. Such information will help to estimate the minimal cost incurred for developing a real-time ASR application; e.g. the minimal budget for hardware possible using large-v3-turbo.
Thanks in advance.
It seems that long-duration audio files cannot be processed correctly.
When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.
It seems that long-duration audio files cannot be processed correctly.
When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.
Is it a problem only with the Turbo model?
It seems that long-duration audio files cannot be processed correctly. When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.
Is it a problem only with the
Turbomodel?
No, using “large-v3” also doesn’t work. The memory usage spikes to 27GB, and then it shows “Killed.”
` model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("11hours.mp3", word_timestamps=True) `
torch.stft cause GPU OOM? I know that in the old stft matrix multiplication mel_spec = self.mel_filters @ magnitudes for long audio files would use a large amount of memory. For this reason, I've written a batch version before.
https://github.com/ben91lin/faster-whisper/blob/mochi/faster_whisper/feature_extractor.py
Why does this repos use mobiuslabsgmbh/faster-whisper-large-v3-turbo and not deepdml/faster-whisper-large-v3-turbo-ct2? And why not something like Systran/faster-whisper-v3-turbo?
See Code: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/utils.py
HF-Links: https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2 https://huggingface.co/mobiuslabsgmbh/faster-whisper-large-v3-turbo
Because the deepdml conversion has wrong alignment heads and tokenizer config, this will mainly affect word timestamps, the mobius labs conversion is closer to the official one
Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:
@tjongsma I could reproduce your findings (with float32 due to my graphics card, but shouldn't make a difference).
Every single point of the the following fixes that in my experimenting:
- Use
initial_prompt="The following is a speech:" - Use
word_timestamps=False - Use
mobiuslabsgmbh/faster-whisper-large-v3-turboinstead ofdeepdml/faster-whisper-large-v3-turbo-ct2 - (Adding 5 second silence on the beginning of the audio file somewhat makes it better, but then "to give the president a chance." is missed)
( @usergit @NilaierMusic @zxl777 ) Large-v3 does work on the clip of @tjongsma :
[0.00s -> 1.58s] So give the president a chance. [1.68s -> 4.76s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [4.92s -> 9.58s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said
(Due my graphics card I use float32, but it probably should be the same with float16 ). But I see that OpenAI uses Large-v2 themselves for their API, see here. But that could be just because they didn't test large-v3 enough.
Why wouldn't they use large-v2 for making the turbo instead if they really saw problems with v3?
@MahmoudAshraf97 The third point of the list above confirms what you're saying. But i still wonder, why we have not something like Systran/faster-whisper-v3-turbo like the other model sizes? Also, is mobiuslabsgmbh trustworthy enough (less downloads)?
Edit: Question: What thresholds to set, to get less hallucination at low sound parts?
Number of downloads doesn't imply trustworthiness, deepdml has much more downloads because it was uploaded first and was shared more widely than mobiuslabs, but when I chose between the two, I used the OpenAI model as the reference and mobiuslabs was identical unlike deepdml, the difference between the two conversions is subtle that almost no one will notice any difference in performance except for some edge cases, Systran didn't upload the new model because they are busy with internal projects so community took into their own hands, having a systran conversion wouldn't make any difference though because converting a model is a single line of code that anyone can execute
Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:
@tjongsma I could reproduce your findings (with float32 due to my graphics card, but shouldn't make a difference).
Every single point of the the following fixes that in my experimenting:
- Use
initial_prompt="The following is a speech:"- Use
word_timestamps=False- Use
mobiuslabsgmbh/faster-whisper-large-v3-turboinstead ofdeepdml/faster-whisper-large-v3-turbo-ct2- (Adding 5 second silence on the beginning of the audio file somewhat makes it better, but then "to give the president a chance." is missed)
( @usergit @NilaierMusic @zxl777 ) Large-v3 does work on the clip of @tjongsma :
[0.00s -> 1.58s] So give the president a chance. [1.68s -> 4.76s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [4.92s -> 9.58s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said
(Due my graphics card I use float32, but it probably should be the same with float16 ). But I see that OpenAI uses Large-v2 themselves for their API, see here. But that could be just because they didn't test large-v3 enough.
Why wouldn't they use large-v2 for making the turbo instead if they really saw problems with v3?
@MahmoudAshraf97 The third point of the list above confirms what you're saying. But i still wonder, why we have not something like
Systran/faster-whisper-v3-turbolike the other model sizes? Also, ismobiuslabsgmbhtrustworthy enough (less downloads)?Edit: Question: What thresholds to set, to get less hallucination at low sound parts?
Thanks so much for verifying @DoS007! I was a bit suspect of the deepml model at the time, but unfortunately there were no alternatives. Wil use mobiuslabsgmbh/faster-whisper-large-v3-turbo going forward!
Hi,
Any idea when Turbo V3 will be available in https://pypi.org/project/faster-whisper/ ?
As, I am interested to try it out at https://github.com/runpod-workers/worker-faster_whisper/tree/main
Thank you for all your effort.
@yccheok hopefully within the next two weeks
Any update on this?