whisper-v3 is worse than whisper-v2 when using faster-whisper
whisper-v3 is worse than whisper-v2 when using faster-whisper and load model from local. whisper v3 result: [0.0 - 10.28]:请不吝点赞 订阅 转发 打赏支持明镜与点点栏目(Completely irrelevant to audio content) and whisper v2 result is normal. test models:https://huggingface.co/Systran/faster-whisper-large-v3; https://huggingface.co/Systran/faster-whisper-large-v2
Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?
Is the inference with openai or hf implementation different ?
Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?
@archive-r @funboarder13920 The same audio, using the interface officially provided by whisper and transformers, everything is normal for whisper-v3 and whisper-v2 and whisper-v3 is better than whisper-v2.
Do you use the large-v3 tokenizer as described by https://huggingface.co/turicas/faster-whisper-large-v3 ? Perhaps run your test using this model as well.
Do you use the large-v3 tokenizer as described by https://huggingface.co/turicas/faster-whisper-large-v3 ? Perhaps run your test using this model as well.
Ok, i will try it.
whisper-v3 is worse than whisper-v2 when using faster-whisper
- There is no such thing as "whisper-v3" nor "whisper-v2".
- This doesn't make sense -> "whisper-v3 is worse than whisper-v2" then "whisper-v3 is better than whisper-v2".
- Model
large-v3is made by OpenAI, you should/can complain there: large-v3 release - You can skip what martinkallstrom posted.
- Make sure that you are using faster-whisper 0.10.0 version.
Sorry, i means that result of decoding whisper-v3 is worse than result of decoding whisper-v2" with using faster-whisper interface . But result of decoding "whisper-v3 is better than result of decoding whisper-v2" with openai whisper interface and huggingface interface. It seems unrelated to the model.
Oh that. Make sure that you are using the same settings. Post what parameters you run.
Just 3 parameter: model.transcribe(audiopath,beam_size=10,language="zh",vad_filter=False)
Did you run same for Whisper?
Did you run same for Whisper?
Yes
Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?
i test two different audio, both decoding result for whisper-v3 are same:”请不吝点赞 订阅 转发 打赏支持明镜与点点栏目”.
And what results do you get from Whisper? Can you share the audio sample with the issue?
Just use large-v2.
zh and some other languages are heavily polluted in whisper's training data.
From large-v2 to large-v3, zh got almost no improvement.
Many people found large-v3 is worse than large-v2 https://github.com/openai/whisper/discussions/1762#discussioncomment-7532295
And what results do you get from Whisper? Can you share the audio sample with the issue? Whisper get absolutely right result.
This kind of problem will appear very randomly,Today, I did test again- for whisper-v3 with fast-whisper interface. For the same audio, it was recognized ten times and the correct result was obtained every time. There was no above situation where the recognition result was completely wrong .
Today, I did test again- for whisper-v3 with fast-whisper interface. For the same audio, it was recognized ten times and the correct result was obtained every time.
Then-v3 maybe you mixed-v3 up results-v3 and actually hallucination-v3 is with reference Whisper.
EDIT:
whisper-v3
There is no such thing. Stop it! 😆
whisper-v3
Unfortunately, even OpenAI incorrectly call it whisper v3 in their news reports. @Purfview
large-v3 do nothing good but just tons of hallucination. A 3rd-party report can be found here.
Stay away from large-v3 or get unfortunate.
Unfortunately, even OpenAI incorrectly call it whisper v3 in their news reports.
Ha, that's where it's coming from, I was thinking what is going on... I guess it's some kind of new marketing trick.
large-v3 do nothing good but just tons of hallucination.
Agree, it's pretty bad. I actually added a warning to my tool if an user runs large-v3. 😆
I mean, if OP don't know how to speak English, please at least append the original Chinese sentences you were thinking of... At least others can copy those into ChatGPT or DeepL to understand it, and fellow Chinese users can stand out and help you communicate... You are just confusing Purfview now...
Is it consistently worse across different sources? Or is it only worse on certain parts of certain audio?
i test two different audio, both decoding result for whisper-v3 are same:”请不吝点赞 订阅 转发 打赏支持明镜与点点栏目”.
Excuse me sir! I have the same problem with you, do you know why this problem occurs or you have any ways to solve this problem? Thanks for your reply!
I have the same question.
And later I learned from others:
- A highly probable way to reproduce this issue is by using a segment with no speech whatsoever with setting: language=zh.
- The erroneous speech transcribe segments often occur in parts without any actual speech.
- some guys guess that whisper-v3 must have used a large amount of audio subtitles from YouTube videos. Lots of Chinese content creators often add advertisements or acknowledgments in the form of subtitles during silent moments.Thus, in segments without sound, it randomly inserted these subtitle contents, essentially turning the entire model into a platform for random advertisements in silent sections.
请不吝点赞 订阅 转发 打赏支持明镜与点点栏目
also occur in my decoding
for v3