Whisper
Whisper copied to clipboard
Accuracy is worse in large model compared to medium model
I understand that you mostly tuned this for Medium model, but at times large model is needed and I think you'd be interested in the result.
I attach the 25 second audio and the transcribed files in this zip file: President.zip
When run the model locally on my PC, medium.en is better than large-v1 which in turns is better than large-v2 (now just named "large").
However, when I ran the file using the demo on Hugging Face, the result of large v2 is fantastic:
So I think there's a bug for the large model? I'm running the latest version (1.10.1). Hope you could add support for the large model too. Thank you!
I think the converter used to conver the model to ggml added some inaccuracies, you could manually convert the models yourself with another pt to ggml converter.
thank you. I have no idea how to do that though.
You could convert a fine tuned model instead, it might help as it has more training data. (same link as previous message)
For me, the large model is significantly worse than the medium and even tiny models. I never knew what the reason was, so maybe a bad conversion explains it.
I agree it seems worse also for me because when I try stream.cpp with the large models it can barely recognize some words
I think a possible reason here is that my graphic card has 8 GB VRAM, and thus not enough to run the large model, which requires 10 GB VRAM. For those who also don't have good result with the large model could you comment what VRAM you have.
I'm using an RTX 4080 with 16gb of VRAM. I'm not sure of any way in Windows to isolate how much VRAM any specific application is using, but I'm definitely not running out as even the total VRAM usage is significantly below 16gb. Also, I would say that the large model is worse even with short audio clips, which you wouldn't think would use much VRAM, but I've never really tested that.
Edit: Using the latest version 1.11 these are my results:
Tiny, which just says [Inaudible] over and over:
Medium which looks pretty accurate:
And large, which like yours is useless and just says "[The President of the United States gives a speech.]"
thanks for reporting that the issue still happens with 16GB VRAM. I have 3070Ti with 8 GB VRAM.
I followed the instructions in this video to install all individual components (Python, PyTorch, etc.) and ran Whisper via Command Prompt. I should be talking directly to the model now and have this memory error when running the large (but not the medium) model:
When I ran the model via the Google Colab Notebook using the large
(large-v2
) model the result was great (same as in the original post above which was run with the demo by Hugging Face).
Hmm, interesting, might just be a separate problem rather than the root cause of this issue, but who knows.
Maybe this bug would be better reported over on whisper.cpp so that @ggerganov can rebuild the large model?
The large model seems to have some weird properties and someone opened an issue on ggerganov repository:
So until I read that post @vricosti linked to I thought ggml-large.bin
and ggml-large-v1.bin
were the same thing just renamed for simplicity as they're the exact same filesize. I just re-ran the President tests and ggml-large-v1.bin
is significantly better than ggml-large.bin
:
Here's ggml-large.bin
, which I now understand to actually be v2, and it still just says [The President of the United States gives a speech.]
:
And here's ggml-large-v1.bin
, which is obviously much better than v2, and has some better and some worse transcription than the medium model:
President-ggml-large-v1-bin.txt
So, yeah, it seems like the large v2 model is fundamentally broken.
@albino1
ggml-large.bin, which I now understand to actually be v2
yep that's correct. large-v2
is just large
now, and the old large
model is now large-v1
So I just ran again the same audio via google Colab and Hugging Face (large-v1 is here and large-v2 is here). The code for Colab notebook is just 3 line
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
!whisper "President.mp3" --model large #(or large-v1)
The results are consistent: large-v2 is better than large-v1 and the transcripts are the same whether running the model via Google Colab or on HuggingFace. The difference between the 2 model is italicized.
Large-v2
: As President of the United States, I rejoice at the first Pan-American Congress in Washington, D.C. I believe that with God's help, our two countries shall continue to live side by side in peace and prosperity.
Large-v1
: As President of the United States, I would like to assert an American Congress in Washington, D.C. I believe that with God's help, our two countries shall continue to live side by side in peace and prosperity.
It looks to me that the online version is the correct ones and there's something wrong with the offline model that we downloaded. Even when comparing the result of the online large-v1
above with the result of offline large-v1
"President-ggml-large-v1-bin.txt" posted by albino1, the transcript aren't the same (while the transcripts are the same whether using Google Collab or HuggingFace).
I'm tempted to post into the thread by ggerganov but I can't run the offline large model properly because I don't have the hardware. It'd be great if someone could make a reply presenting the difference in transcript between the online and offline version for this audio sample that really make a stark difference between large
model and the rest, and between large-v1
and large-v2
. Thanks a lot!
The large model (large-v2) works great with a 3090 using @Const-me implementation - it's actually even more accurate than the Open AI's whisper API!
Also ran some tests using the local whisper with CUDA and PyTorch, and @Const-me implementation is about 40% faster - again, on the large model.