whisper.cpp Output gets corrupted when a quantized finetuned model is used with CUDA

trafficstars

I was testing a quantized Whisper Medium model fine-tuned for Portuguese when I noticed the results were odd.

!!Estamos aqui para pedir emprestada!!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273358.wav.txt'

!! Graças a Deus você está aqui!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273359.wav.txt'

!P!recisamos nos apressar!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273360.wav.txt'

!A necessidade! é pai! na inovação!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273362.wav.txt'

!Você poderia ter mor!! depois! que a paz! fosse declarada
output_txt: saving output to 'medium_q8_0/common_voice_pt_19275111.wav.txt'

It seems that the transcription gets corrupted for some reason. I tried using the CPU and the output is normal, but when using the GPU it is corrupted. Using Q4_0 or Q5_0 results in corruption too.

I also attempted to use another model, a quantized Whisper Small, also fine-tuned for Portuguese, and the output got corrupted too.

Using the original model doesn't generate any corruption and quantized versions of the standard Whisper models also don't generate corruption.

I quantized these models myself so I know it's up to date with the version of whisper.cpp.

In summary:

CPU is normal for any version of the model;
GPU is normal for the original models;
GPU is normal for the standard models, even when quantized;
GPU output is corrupted when using quantized fine-tuned models.

I'm using a RTX 3060 Mobile 6GB VRAM with CUDA 11.5 on a Ubuntu 22.04.4.

Apr 12 '24 14:04 PedroVNasc

I got the same kind of issues with finetuned French models, but that did also occur with the non-quantized models, and with both GPU / CPU inference. With long audios, it works correctly for the first chunks (between 3 and 5 minutes), but at some point, the outputs become English (the transcription is somehow still correct but not the right language) and sometimes it becomes nonsense, repeating special tokens etc. It may also produce a single French chunk before creating garbage again.

I observed this with all finetuned models that I converted (3 of them).

I haven't yet found the reason but it must come (at least in my case) from the convert-h5-to-ggml.py script, which I have not yet looked into.

When I tried using a pre-converted finetuned model, it worked without any issue.

May 15 '24 22:05 pauljouet

Do your models have the same number of tokens as the default whisper models? I have faced a similar issue, and it was solved by changing the source code's hardcoded token counts. It worked at release 1.5.4, however, with the latest release, it doesn't seem to work..

Jun 08 '24 15:06 Madiwka4

whisper.cpp whisper.cpp copied to clipboard

Output gets corrupted when a quantized finetuned model is used with CUDA

whisper.cpp
whisper.cpp copied to clipboard