faster-whisper Benchmark faster whisper turbo v3

#WIP

Benchmark with faster-whisper-large-v3-turbo-ct2

For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations:

Large-v3 model on GPU

Implementation	Precision	Beam size	Time	Max. GPU memory	Max. CPU memory	WER %
openai/whisper-large-v3	fp16	5	2m23s	MB	MB
openai/whisper-turbo	fp16	5	39s	MB	MB
faster-whisper	fp16	5	52.023s	4521MB	901MB	2.883
faster-whisper	int8	5	52.639s	2953MB	2261MB	4.594
faster-distil-large-v3	fp16	5	26.126s	2409MB	900MB	2.392
faster-distil-large-v3	int8	5	22.537s	1481MB	1468MB	2.392
faster-large-v3-turbo	fp16	5	19.155s	2537MB	899MB	1.919
faster-large-v3-turbo	int8	5	19.591s	1545MB	1526MB	1.919

WER on librispeech clean val split.

Oct 01 '24 15:10 asr-lord

We now support the new whisper-large-v3-turbo on Sieve!

Use it via sieve/speech_transcriber: https://www.sievedata.com/functions/sieve/speech_transcriber Use sieve/whisper directly: https://www.sievedata.com/functions/sieve/whisper

Just set speed_boost to True. API guide is under "Usage Guide" tab.

Oct 01 '24 17:10 mvoodarla

Would be great if medium (or more sizes) are added for comparison! In OpenAI's implementation, turbo is 8x faster than v3 (medium is 2x and base is 7x), while offering similar WER as large-v2, which sounds surreal. Wonder how it translate to FW version.

Oct 02 '24 02:10 George0828Zhang

Would be great if medium (or more sizes) are added for comparison! In OpenAI's implementation, turbo is 8x faster than v3 (medium is 2x and base is 7x), while offering similar WER as large-v2, which sounds surreal. Wonder how it translate to FW version.

In my test, with the same 10-minute audio, Medium took 52 seconds, and Turbo took 39 seconds.

Oct 02 '24 05:10 zx3777

You may find this discussion helpful: https://github.com/openai/whisper/discussions/2363#discussion-7264254

372546574-d785c8ca-ffa1-47a3-bdad-9dcb4265f1c0

Oct 02 '24 06:10 createOne999

Compared to Medium, Turbo and large-v3 has a more advanced timeline. Subtitles generated by Turbo appear earlier, but end precisely on time. Medium subtitles also appear early, but to a much lesser extent than Turbo. However, Medium subtitles are delayed in disappearing.

I find that subtitles disappearing late is a better experience than them appearing early. I should still use medium.

Oct 02 '24 06:10 zx3777

Compared to Medium, Turbo and large-v3 has a more advanced timeline. Subtitles generated by Turbo appear earlier, but end precisely on time. Medium subtitles also appear early, but to a much lesser extent than Turbo. However, Medium subtitles are delayed in disappearing.

I find that subtitles disappearing late is a better experience than them appearing early. I should still use medium.

or you could use a forced alignment model after the transcription, much better timings than whisper

Oct 02 '24 09:10 MahmoudAshraf97

Great, thanks for your efforts! I hope turbo will be officially added here soon!

https://github.com/SYSTRAN/faster-whisper/blob/d57c5b40b06e59ec44240d93485a95799548af50/faster_whisper/utils.py#L12-L29

Oct 02 '24 15:10 jhj0517

I benchmarked the models on my laptop using the same audio file in both sequential and batched processing. Seems that large-v3-turbo generally performs exceptionally well, offering greater accuracy than the base model while maintaining efficient processing times.

System Specifications

CPU: Intel Core i7-12650H
GPU: NVIDIA GeForce RTX 3060 Laptop (6 GB VRAM)
RAM: SODIMM Samsung DDR4 8x2 GB 3200 MHz

Benchmark Details

All models were tested with int8 precision.
WER (Word Error Rate) was calculated by comparing the original French subtitles of a video with the transcriptions generated by the models.
The language was explicitly set to French to prevent any translation errors or incorrect transcriptions.

Sequential Processing Benchmark

Model	WER (%)	Total Time (s)	Transcribe Time (s)	Model Load Time (s)
tiny	24.1%	28.95	28.44	0.51
base	16.0%	33.42	32.72	0.70
small	10.5%	55.62	53.21	2.41
medium	10.7%	113.25	106.30	6.95
large	17.6%	240.52	227.31	13.20
large-v1	8.7%	168.58	155.14	13.44
large-v2	8.5%	178.28	164.74	13.53
large-v3	17.6%	230.77	217.43	13.34
large-v3-turbo	9.5%	46.14	38.99	7.15

Observations:

The large-v3-turbo model achieves a WER of 9.5%, which is significantly better than the base model and comparable to large-v2.
In terms of speed, large-v3-turbo completes transcription in 38.99 seconds, much faster than other large models.

Batched Processing Benchmark

For batched processing, I used 10 batches for each model. I tried to use 16 batches, but some models thrown out-of-memory (OOM) errors due to the 6 GB VRAM limit.

Model	WER (%)	Total Time (s)	Transcribe Time (s)	Model Load Time (s)
tiny	23.6%	5.48	4.56	0.92
base	16.5%	6.92	5.70	1.22
small	9.8%	12.45	9.98	2.47
medium	8.9%	26.33	19.47	6.86
large	7.9%	35.97	29.66	6.31
large-v1	12.1%	42.90	29.64	13.26
large-v2	8.8%	43.17	29.71	13.46
large-v3	7.9%	42.97	29.69	13.28
large-v3-turbo	7.7%	18.68	11.47	7.20

Observations:

With batched processing, large-v3-turbo achieves the best WER of 7.7%, outperforming all other models in both accuracy and speed.
The transcribe time for large-v3-turbo is 11.47 seconds, making it suitable for real-time applications even on a laptop GPU.

Conclusions

The large-v3-turbo model offers an excellent balance between accuracy and processing speed, especially evident in batched processing scenarios.
It outperforms the base model in terms of WER while maintaining significantly lower processing times compared to other large models.

Oct 02 '24 23:10 NilaierMusic

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:

from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))

produces

[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

Oct 04 '24 08:10 tjongsma

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.

Oct 04 '24 16:10 NilaierMusic

Sharing the benchmarking results with Turbo model compared to other large whisper models on one of the biggest open source long-form ASR evaluation dataset. Our tests were conducted on a subset of YouTube-commons: yotutube-commons-asr-eval ).

Model	WER	Speed
large v3-turbo	13.40%	129.5x
large v3	13.20%	55.3x
large v2	14.10%	54.6x
distill-large-v3 (en)	15.00%	142.9x

The whisper-turbo model achieves a Word Error Rate (WER) similar to large models and excels in speed among multilingual models.

Oct 05 '24 08:10 Jiltseb

Also it does not seem to support translation task, even though it is mentioned. also tried it with Transformer large-v3 turbo, same behaviour.

Oct 05 '24 21:10 Sharrnah

Also it does not seem to support translation task, even though it is mentioned. also tried it with Transformer large-v3 turbo, same behaviour.

They specifically mentioned the translation task being excluded... https://github.com/openai/whisper/discussions/2363

excluding translation data, on which we don’t expect turbo to perform well.

Oct 06 '24 01:10 George0828Zhang

oh okay. i looked at the huggingface page and there the translation task is mentioned. But maybe it was just copy pasted from the original large-v3 model.

Thanks for pointing that out.

Oct 06 '24 19:10 Sharrnah

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.
Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.

Interesting, did you try on the audio I provided here? It's actually remarkably consistent in how it's worse at short audio clips for me (with "deepdml/faster-whisper-large-v3-turbo-ct2)

Oct 07 '24 07:10 tjongsma

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.
Haven't encountered that. I've tried the same audio and both models return the same transcription. I did notice that turbo model hallucinates more on noisy data than v3, but that's to be expected considering what we saw with common voice 15 benchmark.
Interesting, did you try on the audio I provided here? It's actually remarkably consistent in how it's worse at short audio clips for me (with "deepdml/faster-whisper-large-v3-turbo-ct2)

Yeah, i did specifically try it on your clip with the same model. But i also did it with batched processing and not a sequential one, so i'm not sure if this is a specific issue with it or not. Batched processing with 6 - 10 batches works the best on my setup and actually provides more accurate transcriptions as you can see from my little benchmark earlier in this thread, so i use it for everything.

Oct 08 '24 03:10 NilaierMusic

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech: temp_audio_wav.zip Using v3-turbo:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="deepdml/faster-whisper-large-v3-turbo-ct2", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 9.66s] ...to give the president a chance.

Whereas using medium:
from faster_whisper import WhisperModel
model = WhisperModel(model_size_or_path="medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("temp_audio.wav", initial_prompt="",
                                                                max_new_tokens=224,
                                                                beam_size=5,
                                                                temperature=0,
                                                                language="en",
                                                                word_timestamps=True,
                                                                vad_filter=False)
for seg in segments:
    print("[%.2fs -> %.2fs] %s" % (seg.start, seg.end, seg.text))
produces

[0.00s -> 1.70s] So give the president a chance. [2.00s -> 4.96s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [5.38s -> 9.66s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said...

Not sure if any of you are experiencing anything similar? Or maybe an official faster-whisper turbo-v3 release would perform better.

Btw this happens on the non turbo v3 model as well, I've tried this with a lot of audio files of variable length and it happens a lot. So I've rolled back to v2 model.

Oct 08 '24 04:10 usergit

Thank you so much for recording these benchmarks, I almost cannot believe the speed and quality of these models.

I have one request - for anyone reading this who is benchmarking.

Could some of the people performing benchmarks please record the hardware they are benchmarking on if possible? GPU, CPU and RAM. Such information will help to estimate the minimal cost incurred for developing a real-time ASR application; e.g. the minimal budget for hardware possible using large-v3-turbo.

Thanks in advance.

Oct 08 '24 18:10 klebster2

It seems that long-duration audio files cannot be processed correctly.

When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.

Oct 09 '24 02:10 zxl777

It seems that long-duration audio files cannot be processed correctly.

When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.

Is it a problem only with the Turbo model?

Oct 09 '24 15:10 Jiltseb

It seems that long-duration audio files cannot be processed correctly. When I tested with an 11-hour MP3 file, the memory usage quickly spiked above 26 GB. After 3 minutes, the cli displayed “Killed” and then exited.

Is it a problem only with the Turbo model?

No, using “large-v3” also doesn’t work. The memory usage spikes to 27GB, and then it shows “Killed.”

` model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe("11hours.mp3", word_timestamps=True) `

Oct 09 '24 19:10 zxl777

torch.stft cause GPU OOM? I know that in the old stft matrix multiplication mel_spec = self.mel_filters @ magnitudes for long audio files would use a large amount of memory. For this reason, I've written a batch version before.

https://github.com/ben91lin/faster-whisper/blob/mochi/faster_whisper/feature_extractor.py

resource_usage_cuda

Oct 15 '24 08:10 ben91lin

Why does this repos use mobiuslabsgmbh/faster-whisper-large-v3-turbo and not deepdml/faster-whisper-large-v3-turbo-ct2? And why not something like Systran/faster-whisper-v3-turbo?

See Code: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/utils.py

HF-Links: https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2 https://huggingface.co/mobiuslabsgmbh/faster-whisper-large-v3-turbo

Oct 27 '24 15:10 DoS007

Because the deepdml conversion has wrong alignment heads and tokenizer config, this will mainly affect word timestamps, the mobius labs conversion is closer to the official one

Oct 27 '24 15:10 MahmoudAshraf97

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:

@tjongsma I could reproduce your findings (with float32 due to my graphics card, but shouldn't make a difference).

Every single point of the the following fixes that in my experimenting:

Use initial_prompt="The following is a speech:"
Use word_timestamps=False
Use mobiuslabsgmbh/faster-whisper-large-v3-turbo instead of deepdml/faster-whisper-large-v3-turbo-ct2
(Adding 5 second silence on the beginning of the audio file somewhat makes it better, but then "to give the president a chance." is missed)

( @usergit @NilaierMusic @zxl777 ) Large-v3 does work on the clip of @tjongsma :

[0.00s -> 1.58s] So give the president a chance. [1.68s -> 4.76s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [4.92s -> 9.58s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said

(Due my graphics card I use float32, but it probably should be the same with float16 ). But I see that OpenAI uses Large-v2 themselves for their API, see here. But that could be just because they didn't test large-v3 enough.

Why wouldn't they use large-v2 for making the turbo instead if they really saw problems with v3?

@MahmoudAshraf97 The third point of the list above confirms what you're saying. But i still wonder, why we have not something like Systran/faster-whisper-v3-turbo like the other model sizes? Also, is mobiuslabsgmbh trustworthy enough (less downloads)?

Edit: Question: What thresholds to set, to get less hallucination at low sound parts?

Oct 27 '24 16:10 DoS007

Number of downloads doesn't imply trustworthiness, deepdml has much more downloads because it was uploaded first and was shared more widely than mobiuslabs, but when I chose between the two, I used the OpenAI model as the reference and mobiuslabs was identical unlike deepdml, the difference between the two conversions is subtle that almost no one will notice any difference in performance except for some edge cases, Systran didn't upload the new model because they are busy with internal projects so community took into their own hands, having a systran conversion wouldn't make any difference though because converting a model is a single line of code that anyone can execute

Oct 27 '24 16:10 MahmoudAshraf97

Just chiming in, I've tried using v3-turbo for streaming and found that it hallucinates more/misses audio more than other faster-whisper models. For example for this 10 second audio clip of an obama speech:

@tjongsma I could reproduce your findings (with float32 due to my graphics card, but shouldn't make a difference).

Every single point of the the following fixes that in my experimenting:

Use initial_prompt="The following is a speech:"

Use word_timestamps=False

Use mobiuslabsgmbh/faster-whisper-large-v3-turbo instead of deepdml/faster-whisper-large-v3-turbo-ct2

(Adding 5 second silence on the beginning of the audio file somewhat makes it better, but then "to give the president a chance." is missed)

( @usergit @NilaierMusic @zxl777 ) Large-v3 does work on the clip of @tjongsma :

[0.00s -> 1.58s] So give the president a chance. [1.68s -> 4.76s] Governor Romney, I'm glad that you recognize that al-Qaeda is a threat. [4.92s -> 9.58s] Because a few months ago, when you were asked what's the biggest geopolitical threat facing America, you said

(Due my graphics card I use float32, but it probably should be the same with float16 ). But I see that OpenAI uses Large-v2 themselves for their API, see here. But that could be just because they didn't test large-v3 enough.

Why wouldn't they use large-v2 for making the turbo instead if they really saw problems with v3?

@MahmoudAshraf97 The third point of the list above confirms what you're saying. But i still wonder, why we have not something like Systran/faster-whisper-v3-turbo like the other model sizes? Also, is mobiuslabsgmbh trustworthy enough (less downloads)?

Edit: Question: What thresholds to set, to get less hallucination at low sound parts?

Thanks so much for verifying @DoS007! I was a bit suspect of the deepml model at the time, but unfortunately there were no alternatives. Wil use mobiuslabsgmbh/faster-whisper-large-v3-turbo going forward!

Oct 28 '24 12:10 tjongsma

Hi,

Any idea when Turbo V3 will be available in https://pypi.org/project/faster-whisper/ ?

As, I am interested to try it out at https://github.com/runpod-workers/worker-faster_whisper/tree/main

Thank you for all your effort.

Oct 28 '24 19:10 yccheok

@yccheok hopefully within the next two weeks

Oct 31 '24 09:10 MahmoudAshraf97

Any update on this?

Jan 13 '25 01:01 FerLuisxd

faster-whisper faster-whisper copied to clipboard

Benchmark faster whisper turbo v3

Benchmark with faster-whisper-large-v3-turbo-ct2

Large-v3 model on GPU

System Specifications

Benchmark Details

Sequential Processing Benchmark

Batched Processing Benchmark

Conclusions

faster-whisper
faster-whisper copied to clipboard