whisper.cpp
whisper.cpp copied to clipboard
Regression in accuracy
I wanted to create a separate issue for the problems I described in #354. Since 385236d1d3d7a0228f5279657938ae5f1313ca94, I have seen severe regression in WER for noisy audio, at around 10-20%. I am attaching a noisy German audio that I can reproduce this with.
https://user-images.githubusercontent.com/23424198/212912717-27d4a6fa-3f34-4113-9877-dd355555fefe.mp4
The command I use for both the master branch and above tag are
./bin/stream -l de -m ./gglm-small.bin -kc -ac 512 -t 4 --step 1500 --length 10000
The expected transcription is:
Wir wollen mehr Demokratie wagen. Wir werden unsere Arbeitsweise öffnen und dem kritischen Bedürfnis nach Information Genüge tun.
As noted in the previous issue, I suspect that the main problem is not the temperature or keep-context. My bet would be on either the loss of precision from 32-16 bit conversions, or some bug related to them, since this can directly cause issues with noise robustness (and possibly the overall quality of tiny models) without creating a problem for high-SNR data and bigger models.