whisper.cpp Inferencing result different from original whisper with GPU even when using same model

Is there any parameter that needs to be added into the implementation like in https://github.com/openai/whisper/tree/main/whisper/assets/multilingual ?

I've tested all models and found the inferenced results are different compared to using original whisper with GPU. I'm wondering what is missing in my setup or there is some difference in the implementation of this project?

Dec 10 '22 19:12 ssteo

I don't know this in detail but it's a different implementation, found this bit from the original announcement:

Just a note that the whisper.cpp implementation currently only supports the greedy sampling strategy, so to make a fair comparison with PyTorch, you would need to disable the beam search when running it.

(That's from October though, so I'm not sure if it still applies...things move fast) The original whisper itself gives you different results depending on options (beam size etc.) and apparently there is a possibility of nondeterminism in the play also.

Dec 11 '22 00:12 misutoneko

I also found differences on WER calculation results between large models of PyTorch and whisper.cpp. Whisper.cpp got worse WER score for my tests on large model (ie.. 12% vs 18% WER). Is there any way to bring whisper.cpp with same level accuracy by settings? Naive question but I am learning recently.

Dec 11 '22 17:12 RYucel

The decoding strategy in whisper.cpp is not exactly the same as the one in the original OpenAI repo. Differences can be expected and likely whisper.cpp is inferior atm. In any case, if you want to make fair comparisons between the two, make sure to run the PyTorch version using the Greedy decoder as explained in the README.

@RYucel Can you give a tutorial for computing WER? Are you running the PyTorch implementation with the Greedy decoder?

Dec 11 '22 18:12 ggerganov

I've encountered this as well with the whisper commandline vs. using whisper from a python script (both have different defaults), see here:

https://github.com/openai/whisper/discussions/591

The default parameters that the python whisper command line tool uses are:

result = model.transcribe("audio.mp3", language=language, task='transcribe', temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), best_of=5, beam_size=5, suppress_tokens="-1", condition_on_previous_text=True, fp16=True, compression_ratio_threshold=2.4, logprob_threshold=-1., no_speech_threshold=0.6)

Biggest difference is that the python whisper decoder does beamsearch, conditions the segments on the preceding ones, temperature back-off when a compression ratio signals likely faulty output (see the example in the whisper discussion link). whisper.cpp already mentions in doesn't do beamsearch, my guess it doesn't do any of the other stuff either.

You can also try to check if the outputs are more similar if you set best_of=1, beam_size=1 or best_of=None, beam_size=None, basically making Python whisper do greedy decoding too.

Jan 03 '23 16:01 bmilde

With the latest version the whisper.cpp results should be better and hopefully closer to the Python implementation.

By default, the main example corresponds to:

temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
best_of=5
beam_size=None
suppress_tokens="-1",
condition_on_previous_text=True
fp16=True
compression_ratio_threshold=2.4
logprob_threshold=-1.

You can enable beamsearch via--beam_size 5 - it is disabled by default.

Jan 15 '23 14:01 ggerganov

Hi @ggerganov You have done something phenomenal with this work! Sorry to comment on a closed issue but I was wondering if there is any switch to set --condition_on_previous_text to False?

Feb 10 '23 03:02 crisdosaygo

@crisdosyago Passing --max-context 0 to main should be equivalent to --condition_on_previous_text False

Feb 14 '23 17:02 ggerganov

Thank you, sir!

Feb 15 '23 01:02 crisdosaygo

whisper.cpp whisper.cpp copied to clipboard

Inferencing result different from original whisper with GPU even when using same model

whisper.cpp
whisper.cpp copied to clipboard