faster-whisper icon indicating copy to clipboard operation
faster-whisper copied to clipboard

whisper-v3 is worse than whisper-v2 when using faster-whisper

Open zyb8543d opened this issue 2 years ago • 22 comments

whisper-v3 is worse than whisper-v2 when using faster-whisper and load model from local. whisper v3 result: [0.0 - 10.28]:请不吝点赞 订阅 转发 打赏支持明镜与点点栏目(Completely irrelevant to audio content) and whisper v2 result is normal. test models:https://huggingface.co/Systran/faster-whisper-large-v3; https://huggingface.co/Systran/faster-whisper-large-v2

zyb8543d avatar Nov 28 '23 06:11 zyb8543d

Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?

hoonlight avatar Nov 28 '23 09:11 hoonlight

Is the inference with openai or hf implementation different ?

funboarder13920 avatar Nov 28 '23 09:11 funboarder13920

Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?

@archive-r @funboarder13920 The same audio, using the interface officially provided by whisper and transformers, everything is normal for whisper-v3 and whisper-v2 and whisper-v3 is better than whisper-v2.

zyb8543d avatar Nov 28 '23 11:11 zyb8543d

Do you use the large-v3 tokenizer as described by https://huggingface.co/turicas/faster-whisper-large-v3 ? Perhaps run your test using this model as well.

martinkallstrom avatar Nov 28 '23 12:11 martinkallstrom

Do you use the large-v3 tokenizer as described by https://huggingface.co/turicas/faster-whisper-large-v3 ? Perhaps run your test using this model as well.

Ok, i will try it.

zyb8543d avatar Nov 28 '23 12:11 zyb8543d

whisper-v3 is worse than whisper-v2 when using faster-whisper

  1. There is no such thing as "whisper-v3" nor "whisper-v2".
  2. This doesn't make sense -> "whisper-v3 is worse than whisper-v2" then "whisper-v3 is better than whisper-v2".
  3. Model large-v3 is made by OpenAI, you should/can complain there: large-v3 release
  4. You can skip what martinkallstrom posted.
  5. Make sure that you are using faster-whisper 0.10.0 version.

Purfview avatar Nov 28 '23 12:11 Purfview

Sorry, i means that result of decoding whisper-v3 is worse than result of decoding whisper-v2" with using faster-whisper interface . But result of decoding "whisper-v3 is better than result of decoding whisper-v2" with openai whisper interface and huggingface interface. It seems unrelated to the model.

zyb8543d avatar Nov 28 '23 13:11 zyb8543d

Oh that. Make sure that you are using the same settings. Post what parameters you run.

Purfview avatar Nov 28 '23 13:11 Purfview

Just 3 parameter: model.transcribe(audiopath,beam_size=10,language="zh",vad_filter=False)

zyb8543d avatar Nov 28 '23 13:11 zyb8543d

Did you run same for Whisper?

Purfview avatar Nov 28 '23 13:11 Purfview

Did you run same for Whisper?

Yes

zyb8543d avatar Nov 28 '23 14:11 zyb8543d

Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?

i test two different audio, both decoding result for whisper-v3 are same:”请不吝点赞 订阅 转发 打赏支持明镜与点点栏目”.

zyb8543d avatar Nov 28 '23 14:11 zyb8543d

And what results do you get from Whisper? Can you share the audio sample with the issue?

Purfview avatar Nov 28 '23 14:11 Purfview

Just use large-v2. zh and some other languages are heavily polluted in whisper's training data. From large-v2 to large-v3, zh got almost no improvement. Many people found large-v3 is worse than large-v2 https://github.com/openai/whisper/discussions/1762#discussioncomment-7532295

cxumol avatar Nov 29 '23 00:11 cxumol

And what results do you get from Whisper? Can you share the audio sample with the issue? Whisper get absolutely right result.

This kind of problem will appear very randomly,Today, I did test again- for whisper-v3 with fast-whisper interface. For the same audio, it was recognized ten times and the correct result was obtained every time. There was no above situation where the recognition result was completely wrong .

zyb8543d avatar Nov 29 '23 01:11 zyb8543d

Today, I did test again- for whisper-v3 with fast-whisper interface. For the same audio, it was recognized ten times and the correct result was obtained every time.

Then-v3 maybe you mixed-v3 up results-v3 and actually hallucination-v3 is with reference Whisper.

EDIT:

whisper-v3

There is no such thing. Stop it! 😆

Purfview avatar Nov 29 '23 01:11 Purfview

whisper-v3

Unfortunately, even OpenAI incorrectly call it whisper v3 in their news reports. @Purfview

large-v3 do nothing good but just tons of hallucination. A 3rd-party report can be found here.

Stay away from large-v3 or get unfortunate.

cxumol avatar Nov 29 '23 02:11 cxumol

Unfortunately, even OpenAI incorrectly call it whisper v3 in their news reports.

Ha, that's where it's coming from, I was thinking what is going on... I guess it's some kind of new marketing trick.

large-v3 do nothing good but just tons of hallucination.

Agree, it's pretty bad. I actually added a warning to my tool if an user runs large-v3. 😆

Purfview avatar Nov 29 '23 02:11 Purfview

I mean, if OP don't know how to speak English, please at least append the original Chinese sentences you were thinking of... At least others can copy those into ChatGPT or DeepL to understand it, and fellow Chinese users can stand out and help you communicate... You are just confusing Purfview now...

escape0707 avatar Dec 03 '23 06:12 escape0707

Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?

i test two different audio, both decoding result for whisper-v3 are same:”请不吝点赞 订阅 转发 打赏支持明镜与点点栏目”.

Excuse me sir! I have the same problem with you, do you know why this problem occurs or you have any ways to solve this problem? Thanks for your reply!

LLLYF avatar Feb 28 '24 09:02 LLLYF

I have the same question.

And later I learned from others:

  1. A highly probable way to reproduce this issue is by using a segment with no speech whatsoever with setting: language=zh.
  2. The erroneous speech transcribe segments often occur in parts without any actual speech.
  3. some guys guess that whisper-v3 must have used a large amount of audio subtitles from YouTube videos. Lots of Chinese content creators often add advertisements or acknowledgments in the form of subtitles during silent moments.Thus, in segments without sound, it randomly inserted these subtitle contents, essentially turning the entire model into a platform for random advertisements in silent sections.

testmana2 avatar Apr 07 '24 14:04 testmana2

请不吝点赞 订阅 转发 打赏支持明镜与点点栏目
also occur in my decoding for v3

yanjian1978 avatar Jul 11 '24 07:07 yanjian1978