faster-whisper whisper-v3 is worse than whisper-v2 when using faster-whisper

whisper-v3 is worse than whisper-v2 when using faster-whisper and load model from local. whisper v3 result: [0.0 - 10.28]：请不吝点赞订阅转发打赏支持明镜与点点栏目(Completely irrelevant to audio content) and whisper v2 result is normal. test models：https://huggingface.co/Systran/faster-whisper-large-v3； https://huggingface.co/Systran/faster-whisper-large-v2

Nov 28 '23 06:11 zyb8543d

Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?

Nov 28 '23 09:11 hoonlight

Is the inference with openai or hf implementation different ?

Nov 28 '23 09:11 funboarder13920

Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?

@archive-r @funboarder13920 The same audio, using the interface officially provided by whisper and transformers, everything is normal for whisper-v3 and whisper-v2 and whisper-v3 is better than whisper-v2.

Nov 28 '23 11:11 zyb8543d

Do you use the large-v3 tokenizer as described by https://huggingface.co/turicas/faster-whisper-large-v3 ? Perhaps run your test using this model as well.

Nov 28 '23 12:11 martinkallstrom

Do you use the large-v3 tokenizer as described by https://huggingface.co/turicas/faster-whisper-large-v3 ? Perhaps run your test using this model as well.

Ok, i will try it.

Nov 28 '23 12:11 zyb8543d

whisper-v3 is worse than whisper-v2 when using faster-whisper

There is no such thing as "whisper-v3" nor "whisper-v2".
This doesn't make sense -> "whisper-v3 is worse than whisper-v2" then "whisper-v3 is better than whisper-v2".
Model large-v3 is made by OpenAI, you should/can complain there: large-v3 release
You can skip what martinkallstrom posted.
Make sure that you are using faster-whisper 0.10.0 version.

Nov 28 '23 12:11 Purfview

Sorry, i means that result of decoding whisper-v3 is worse than result of decoding whisper-v2" with using faster-whisper interface . But result of decoding "whisper-v3 is better than result of decoding whisper-v2" with openai whisper interface and huggingface interface. It seems unrelated to the model.

Nov 28 '23 13:11 zyb8543d

Oh that. Make sure that you are using the same settings. Post what parameters you run.

Nov 28 '23 13:11 Purfview

Just 3 parameter： model.transcribe(audiopath,beam_size=10,language="zh",vad_filter=False)

Nov 28 '23 13:11 zyb8543d

Did you run same for Whisper?

Nov 28 '23 13:11 Purfview

Did you run same for Whisper?

Yes

Nov 28 '23 14:11 zyb8543d

Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?

i test two different audio, both decoding result for whisper-v3 are same：”请不吝点赞订阅转发打赏支持明镜与点点栏目”.

Nov 28 '23 14:11 zyb8543d

And what results do you get from Whisper? Can you share the audio sample with the issue?

Nov 28 '23 14:11 Purfview

Just use large-v2. zh and some other languages are heavily polluted in whisper's training data. From large-v2 to large-v3, zh got almost no improvement. Many people found large-v3 is worse than large-v2 https://github.com/openai/whisper/discussions/1762#discussioncomment-7532295

Nov 29 '23 00:11 cxumol

And what results do you get from Whisper? Can you share the audio sample with the issue? Whisper get absolutely right result.

This kind of problem will appear very randomly，Today, I did test again- for whisper-v3 with fast-whisper interface. For the same audio, it was recognized ten times and the correct result was obtained every time. There was no above situation where the recognition result was completely wrong .

Nov 29 '23 01:11 zyb8543d

Today, I did test again- for whisper-v3 with fast-whisper interface. For the same audio, it was recognized ten times and the correct result was obtained every time.

Then-v3 maybe you mixed-v3 up results-v3 and actually hallucination-v3 is with reference Whisper.

EDIT:

whisper-v3

There is no such thing. Stop it! 😆

Nov 29 '23 01:11 Purfview

whisper-v3

Unfortunately, even OpenAI incorrectly call it whisper v3 in their news reports. @Purfview

large-v3 do nothing good but just tons of hallucination. A 3rd-party report can be found here.

Stay away from large-v3 or get unfortunate.

Nov 29 '23 02:11 cxumol

Unfortunately, even OpenAI incorrectly call it whisper v3 in their news reports.

Ha, that's where it's coming from, I was thinking what is going on... I guess it's some kind of new marketing trick.

large-v3 do nothing good but just tons of hallucination.

Agree, it's pretty bad. I actually added a warning to my tool if an user runs large-v3. 😆

Nov 29 '23 02:11 Purfview

I mean, if OP don't know how to speak English, please at least append the original Chinese sentences you were thinking of... At least others can copy those into ChatGPT or DeepL to understand it, and fellow Chinese users can stand out and help you communicate... You are just confusing Purfview now...

Dec 03 '23 06:12 escape0707

Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?

i test two different audio, both decoding result for whisper-v3 are same：”请不吝点赞订阅转发打赏支持明镜与点点栏目”.

Excuse me sir! I have the same problem with you, do you know why this problem occurs or you have any ways to solve this problem? Thanks for your reply!

Feb 28 '24 09:02 LLLYF

I have the same question.

And later I learned from others:

A highly probable way to reproduce this issue is by using a segment with no speech whatsoever with setting: language=zh.
The erroneous speech transcribe segments often occur in parts without any actual speech.
some guys guess that whisper-v3 must have used a large amount of audio subtitles from YouTube videos. Lots of Chinese content creators often add advertisements or acknowledgments in the form of subtitles during silent moments.Thus, in segments without sound, it randomly inserted these subtitle contents, essentially turning the entire model into a platform for random advertisements in silent sections.

Apr 07 '24 14:04 testmana2

请不吝点赞订阅转发打赏支持明镜与点点栏目
also occur in my decoding for v3

Jul 11 '24 07:07 yanjian1978