whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Add ability to limit auto-detection to a subset of languages

Open sindresorhus opened this issue 10 months ago • 3 comments

From what I can tell, auto-detection simply picks the language with the highest probability score. Sometimes, I know the language could only be 1 out of 5 possible languages. In such cases, it would be useful to be able to specify the possible languages, to improve the likelyhood of the auto-detection picking the correct language.

sindresorhus avatar Sep 02 '23 19:09 sindresorhus

You can use the whisper_pcm_to_mel() + whisper_lang_auto_detect() API. You will get the probs for all languages in the lang_probs array:

https://github.com/ggerganov/whisper.cpp/blob/59a3d0cb576db605f76f82f07350647837e15c7a/whisper.h#L244-L255

ggerganov avatar Sep 05 '23 12:09 ggerganov

Thanks for the hint. That definitely works. It would still be nice to have a single param for this though.

sindresorhus avatar Sep 05 '23 16:09 sindresorhus

whisper.cpp_limit_language_autodetection_patch.diff.gz

Here's a little patch you can try. This will extend the "auto" parameter in the main example so that you can give it a list of allowed languages. So instead of -l auto you would use something like -l auto:pt,es,sv,en

Please note that although this seems to be working, I won't be making a PR out of it. But feel free to use the code as you wish.

misutoneko avatar Sep 06 '23 15:09 misutoneko