whisper.cpp tests : add WER benchmarks

trafficstars

It would be nice to start measuring the word error rate (WER) of whisper.cpp across some representative dataset:

short audio
long audio
english
non-english
etc.

This will help us catch regressions in the future. I'm not familiar with what is typically used for TTS WER benchmarks, so looking for help from the community.

Oct 05 '24 09:10 ggerganov

Hi Grigory, perhaps we can use LibriSpeech for measuring long audio (approx ~ 1000 hours but could trim it to fit for requirements). For short audio, we can use Libri-Light

Alternatively, there are other audio datasets for measuring WER: https://github.com/jim-schwoebel/voice_datasets

I could start making small sample scripts to see how whisper.cpp fairs among these datasets

Feb 05 '25 08:02 harvestingmoon

Thanks. Yes, I'm not sure what is typically used. But in general, I think any dataset would work. The main goal here is not to compare whisper.cpp numbers with other numbers, but to create a reference set of WER numbers that we track as the development continues. This would allow us to catch regressions when they appear, because the WER scores would get worse in such cases.

Ideally, we can have scripts that perform heavier benchmarks that developers would use locally. But we also need a mode where the scripts run just a few fast benchmarks that can be added to the CI without overloading it, so that these would be computed on every commit.

Feb 05 '25 09:02 ggerganov

@harvestingmoon are you working on this?

Feb 07 '25 10:02 foldl

@foldl hi yes I'm looking at it, more or less likely to start after 12 as it's currently Chinese new year period...

Feb 07 '25 10:02 harvestingmoon

I think we need a tiny dataset (~10MB) just contained in this repo. WER can then be measured on-the-fly.

Feb 17 '25 00:02 foldl

Sorry please ignore the WER calculation above, I will develop another script since the calculations are completely off from what it should be . I will also look for a smaller lightweight dataset so that audio can be measured on the fly

Feb 17 '25 06:02 harvestingmoon

I have created a better and more robust lightweight script that meets the requirements @foldl , @ggerganov

WER is measured at 0.3.

It uses this lightweight dataset: https://arxiv.org/abs/2104.01497 and is based off nvidia's tutorial for calculating WER: https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tutorials/tts-evaluate.html

My script calculates the WER for each individual audio file as well as the overall average file, here is the pull request #2824 For context, WER is measured between 0 and 1. Ideally, it is good to have a WER of around 0.33%, that means transcription accuracy is about 67%. The current measurement is around 70%, which is fairly good for a lightweight model.

Link for reference: https://huggingface.co/learn/audio-course/en/chapter5/evaluation

Feb 17 '25 13:02 harvestingmoon

The pull request contains the script as well as the full ~10mb dataset, making it fairly lightweight when measuring on the fly as well

Feb 17 '25 13:02 harvestingmoon

Hi @harvestingmoon, thank you for the effort, but I expect more attention to detail. Will close the PR for now and let someone else give this a try.

Feb 18 '25 18:02 ggerganov

Hint - It's absolutely pointless in the sense of tangible benefit to user to use a unmodified WER. And somehow an even more pointless metic is CER as used in the paper, that necessitates not reading the tutorial/article.

Hint - never use jiwer it's missing a few important metrics that would be there if the paper referenced was understood rather than parsed. see capricious hobby below.

capricious hobby - Morris, Andrew & Maier, Viktoria & Green, Phil. (2004). From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition.

At the heart of RIL is a desire to "rank" errors copied from the mostly overtly obnoxious and useful in niche cases overflow site

to succintly summarize without any commonly employed deceit - no attention to gross details, no understanding is the root cause of the above failure.

but failure is a part of the process that leads to remarkable success. fail often.

mostly indirectly related how to germane labels to issues to enable doc contributions from interested contributors/collaborators, for lack of more congruous words? please see https://github.com/ggerganov/whisper.cpp/issues/2848

thanks

Feb 27 '25 15:02 ket395

Should nt the very first step to add an edit dist (used to compute WER/TER) minimalist src code (header only?) to measure it? eg https://github.com/flashlight/flashlight/blob/f59d770b52ea678b039d9ba44693341ba80cf7c5/flashlight/fl/meter/EditDistanceMeter.h

Mar 13 '25 17:03 WilliamTambellini

@ggerganov I'm not sure there is a reasonably sized dataset containing short audio, long audio, english, & non-english content.

What do you think about an approach like tests/run-tests.sh in the CI script to measure WER for a list of audio URLs in the aforementioned categories? Could print the results for each category/model as a table.

This would be lightweight for each commit.

We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration.

Mar 16 '25 23:03 redraskal

What do you think about an approach like tests/run-tests.sh in the CI script to measure WER for a list of audio URLs in the aforementioned categories? Could print the results for each category/model as a table.

Yes, sounds good. The CI should download audio files with wget or curl and run WER tests on them. We can combine different sources at the start. Later on, we can expand the more powerful nodes such as the CUDA and M1 to run larger datasets.

We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration.

Yes, a much bigger dataset for local testing would be useful.

Mar 17 '25 07:03 ggerganov

@ggerganov Hi. I was working on this ticket for a while, and spent last few days benchmarking whisper.cpp on LibriSpeech corpus.

Now, here is the summary of my measurement results:

The following graph shows the recognition accuracy (measured in Word Error Rate) on LibriSpeech test-clean dataset.

Comparison with OpenAI whisper

To illustrate the result shown above, the following table compares whisper.cpp's performance with OpenAI's official WER scores.

To put it very short, the performance was pretty much comparable!

Model	WER [whisper.cpp]	WER [openai-whisper] *
tiny	6.90	6.7
base	4.81	4.9
small	3.37	3.3
medium	2.70	2.7
large-v1	2.67	2.8
large-v2	2.58	2.5
large-v3	1.85	Not published
large-v3-turbo	1.92	Not published

Official scores. Retrieved from https://arxiv.org/abs/2212.04356 (Appendix D.1.2).

How I performed the benchmark test

I submitted the code I wrote for the benchmark test in PR #2999. The code should be basically the same as how OpenAI evaluate their models.

The testing process is fairly automated (using the power of Makefile), and I also attached some documentation how to use it.

Please tell me if anything is unclear! I hope it's interesting for you.

Apr 03 '25 09:04 fujimotos

@fujimotos Thank you, this is very interesting! Will be taking a look in the next few days.

Apr 03 '25 10:04 ggerganov

@ggerganov Thank you!

Techinical Note: how long it took to perform the full benchmark

This time, I rent an EC2 c8g.xlarge instance from AWS to perform the benchmark test.

It took roughly 80 hours to benchmark all the eight model sizes. Here is the breakdown of the running time:

MODEL	WER	TIME [REAL]	Real Time Factor
tiny	6.90	28m	0.08
base	4.81	56m	0.17
small	3.37	3h2m	0.56
medium	2.70	9h20m	1.72
large-v1	2.67	17h52m	3.30
large-v2	2.58	17h55m	3.31
large-v3	1.85	17h46m	3.29
large-v3-turbo	1.92	14h28m	2.67

Observation: Tradeoff between speed vs accuracy

Looking from a different angle, I think this confirms the existence of trade-off between speed vs accuracy in whisper.cpp models.

The following graph should illustrate the relationship:

The X-axis ("Real time factor") is computed by (inference time) / (Audio Length), so the lower is better.
Note that LibriSpeech test-clean contains 5 hours 24 minutes of speech.

Apr 03 '25 11:04 fujimotos

It would be interesting to perform these benchmarks with Q8_0 quantized models and see how the WER changes. But I think it would be better to run this on a GPU in order to reduce the processing time. Will see how this performs on my M2 Ultra - I think it would be much faster than the AWS instance.

Apr 03 '25 11:04 ggerganov

Here are some results on M2 Ultra with Flash Attention enabled:

MODEL	WER	TIME [REAL]
base	4.90	13m28s
base-q8_0	4.89	12m32s
small	3.39	24m4s
small-q8_0	3.36	20m33s

Though the timings might be a bit off because I was using the computer while the computations were running. But overall, there is no degradation of the quality when going to Q8 models, which is expected, but good to confirm.

Apr 04 '25 15:04 ggerganov

The WER tests in #2999 are very useful, but all samples from the LibriSpeech dataset are relatively short and don't include non-speech segments. We should add some additional tests with longer audio samples, preferably with silence intervals which is what usually trips Whisper Large v3. When we add the VAD support (#3065) we will be able to measure quantitatively how much it improves the quality in such cases.

May 12 '25 08:05 ggerganov

I guess this dataset has what you need, I'm using it in longform evaluation in faster whisper https://github.com/SYSTRAN/faster-whisper/pull/1101

May 12 '25 09:05 MahmoudAshraf97

Actually, I know a couple of public benchmark datasets that can be used for the benchmark purpose.

When we add the VAD support (#3065) we will be able to measure quantitatively how much it improves the quality in such cases.

If you don't mind, I think I'm able to post another PR next week that contains long-form WER benchmark testing.

May 12 '25 10:05 fujimotos

@ggerganov @danbev I have just created a pull-request #3185 that adds a long-form transcription benchmark test.

Benchmark dataset

This time I used Earnings-21 dataset by Del Rio et al. (2020) which provides 49 hours of English speech, sourced from corporate earning calls.

Earnings-21: A Practical Benchmark for ASR in the Wild (2021) https://arxiv.org/abs/2104.11348

Here is some audio example:

https://github.com/user-attachments/assets/b09512da-556e-4d5b-8495-fc9d7ac20e20

I think there are two benefits in using Earnings-21:

It makes the benchmark result comparable. OpenAI used this dataset in their paper, so we can compare our WER score against OpenAI's official number.
Easy to access. The full dataset is distributed as a Git repo, and the total file size is relatively small (just 49 files in mp3 format)

Benchmark Result

I ran the benchmark test using two models: tiny and base. I also tested the VAD support (introduced by #3065) to see if it improves the general accuracy.

The following table summarizes the benchmark result:

Speech Recognition	WER (Tiny)	WER (Base)
Whisper.cpp	17.37%	12.53%
Whisper.cpp (w. VAD)	18.91%	15.70%
OpenAI Whisper	18.7%	13.5%
OpenAI Whisper (`.en` model)	17.0%	12.5%

Some notes on this table:

The version of whisper.cpp I have used was 2c4b904596, and I enabled the VAD support by adding the following inference parameter:
```
WHISPER_FLAGS = --no-prints --language en --output-txt --vad --vad-model ../../models/silero-v5.1.2-ggml.bin
```
OpenAI's scores are retrived from their original paper (See Appendix D.4)

Some Analysis and Insights

So wondering why VAD did not necessarily improve the recognition accuracy, I looked a bit deeper at the benchmark result.

First, the following graph shows the detailed WER score for each audio recording:

As you can see, the effect of enabling VAD is hit-and-miss. It improves the performance on some audio, but degrades the accuracy on another audio.

Looking at the transcription produced by whisper.cpp, it seems that the VAD support does prevent hallucination on same case, but it introduces another hallucination on another audio data.

So its effectiveness on improving recognition accuracy was limited (I attached some hallucination examples below).

Appendix: Hallucination Examples

4359732.mp3 (Kuehne Nagel International)

Whisper.cpp (tiny)

we have a second we have a second really state portfolio is I at for sale I do not expect material impact on the PNL.
That is going to be the last of 10 million.
Thank you very much.
Thank you.
Thank you very much.
I want to thank you very much.
Thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
I want to thank you very much.
...

With VAD enabled

We have a second. We have a second real estate portfolio is for sale,
I do not expect material impact on the P&L other state. So it's going to be less than 10 million.
Thank you very much. Thank you.
Thank you for coming from Manivakaya, and from Bank of America, please go ahead.

VAD prevents the hallucination successfully.

4320211.mp3 (Monroe Inc)

Whisper.cpp (tiny)

At the mid-point of our guidance range, we expect an operating margin of approximately $10.2% interest expense to be approximately $29 million,
depreciation and amortization to be approximately $65 million, and even to be approximately $196 million.
We expect capital expenditures to be approximately $60 million this year.
This guidance reflects an effective tax rate of approximately 23.5% and is based on $34 million diluted weighted average shares outstanding.

With VAD enabled

The appreciation and amortization to be approximately $65 million, and even to be approximately $196 million. We expect capital expenditures to be approximately $60 million this year.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.
An adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis, which is an adjusted basis.

VAD introduces hallucination where it did not occur originally.

May 23 '25 02:05 fujimotos

I'd be able to do more detailed analysis if I had access to more computing power (currenltly I did all the benchmark test on my personal AWS account), but, anyway, this is the current state of analysis so far.

@ggerganov @danbev If anything is not clear, please just ask me! Hope it's interesting to you.

May 23 '25 02:05 fujimotos

@fujimotos Thank you - very interesting and useful! Will be taking a look in the next days - curious to understand why the VAD can lead to degraded quality. Somehow my expectation is that it should either improve or keep the quality the same in all cases. Maybe some parameters need adjustment. Also, would be interesting to run VAD/no-VAD with the Large V3 and see how it compares there.

May 23 '25 10:05 ggerganov

Very interesting indeed! I've saw something similar this week when using the Large V3 model where without VAD it would hallucinate but when VAD was enabled it did not and seemed to produce valid output.

I''ve tried running 4320211.mp3 (Monroe Inc) with Large V3 does not show these hallucinations:

[00:25:12.680 --> 00:25:29.990]   At the midpoint of our guidance range, we expect an operating margin of approximately 10.2% interest expense to be approximately $29 million depreciation and amortization to be approximately $65 million in EBITDA to be approximately $196 million.
[00:25:29.990 --> 00:25:43.880]   We expect capital expenditures to be approximately $60 million this year. This guidance reflects an effective tax rate of approximately 23.5% and is based on 34 million diluted weighted average shares outstanding.
[00:25:43.880 --> 00:25:49.160]   As always, our guidance does not assume any future acquisitions or greenfield store opening.
[00:25:49.160 --> 00:25:54.690]   I'll now turn the call over to brought, provide some closing remarks before we move to Q&A.
[00:25:54.690 --> 00:26:03.580]   Thanks, Brian. We are making solid strides in the execution of our Monroe forward strategy, in particular, our store rebrand and reimage initiative.

And I also tried with the tiny model:

[00:25:13.160 --> 00:25:19.050]   At the midpoint of our guidance range, we expect an operating margin of approximately $10.2%
[00:25:19.050 --> 00:25:25.480]   interest expense to be approximately $29 million, depreciation and amortization to be approximately
[00:25:25.480 --> 00:25:33.720]   $65 million and EBITDA to be approximately $196 million. We expect capital expenditures to be approximately
[00:25:33.720 --> 00:25:40.180]   $60 million this year. This guidance reflects an effective tax rate of approximately 23.5%
[00:25:40.180 --> 00:25:45.890]   and is based on $34 million diluted weighted average shares outstanding. As always, our guidance
[00:25:45.890 --> 00:25:51.440]   does not assume any future acquisitions or greenfield store opening. I'll now turn the call over
[00:25:51.440 --> 00:25:57.560]   to brought some closing remarks before we move to Q&A. Thanks Brian. We are making solid
[00:25:57.560 --> 00:26:03.960]   strides in the execution of our Monroe Forward Strategy in particular our store rebrand and remaged initiative.

One thing to note is that I'm using the version of whisper.cpp from #3173 which I've been working on this week. The changes were mostly related to how VAD timestamps are aligned to the original audio timestamps, but I also changed from using float/doubles to use int64_t for the timestamps and perhaps this has an impact on the audio samples that are passed to whisper_full. I need to look into this a bit further but it would be interesting to run the benchmarks using #3173 to see if this has an impact (and also that I'm not missing something being the end of the week).

I've run the benchmarks on my mac (macOS 15.4.1 on Mac15,3 with Apple M3 and 24GB RAM) using the tiny model and also applied #3173 with the following result with VAD enabled:

(venv) $ cat tiny.txt
WER: 16.78%

And without VAD:

(venv) $ cat tiny.txt
WER: 18.70%

I'm not seeing the repeats in speech-datasets/earnings21/media/4320211.mp3.txt:

At the midpoint of our guidance range, we expect an operating
margin of approximately 10.2% interest expense to be approximately $29 million dollars depreciation
and amortization to be approximately $65 million dollars and EBITDA to be approximately $196 million.
We expect capital expenditures to be approximately $60 million this year.
This guidance reflects an effective tax rate of approximately 23.5% and is based on 34 million diluted
weighted average shares outstanding.

I'll try using the base model with VAD:

(venv) $ cat base.txt
WER: 13.40%

And without VAD:

$ cat base.txt
WER: 12.57%

May 23 '25 11:05 danbev

Tks @danbev Indeed would be better to (re)test with whisperV3.

May 23 '25 17:05 WilliamTambellini

whisper.cpp whisper.cpp copied to clipboard

tests : add WER benchmarks

Benchmark dataset

Benchmark Result

Some Analysis and Insights

Appendix: Hallucination Examples

4359732.mp3 (Kuehne Nagel International)

4320211.mp3 (Monroe Inc)

whisper.cpp
whisper.cpp copied to clipboard