whisper.cpp language/translate doesn't work for mixed-language audio

trafficstars

When I give an audio file with mixed-language content (e.g. English and Japanese) as an input, I can't seem to get the transcript in both languages as they were spoken.

-l en (no --translate) transcribes English in English, and translates Japanese into English
-l en --translate transcribes English in English, and translates Japanese into English
-l ja (no --translate) translates English into Japanese and transcribes Japanese in Japanese.
-l ja --translate transcribes English in English and translates Japanese into English

It's counter intuitive to me that the --translate flag just doesn't do anything, and even without the flag it tries to translate the language anyway.

sample audio file: https://cache.rebuild.fm/podcast-ep334.mp3

Is there an option that I'm missing, to get the transcript out as they were spoken, without any translation?

Nov 17 '22 07:11 miyagawa

Someone correct me if I'm wrong, but I think the model only handles X -> X Transcription and X -> English translation, where X is an arbitrary language. From my own experience, I have seen people pointing out on the Python discussions that they are able to translate from English -> X, but I believe this translation is really bad, and the model is not designed for this type of translation.

From what you wrote I assume you are looking for a way to translate the English parts into English text, and the Japanese parts into Japanese text. If so, this would require a check to identify the parts of the audio that contains the different languages. E.g. splitting up the audio file into multiple parts, and running each part through the model with the correct language tag before it is glued back together. I don't think there is any support for this yet.

Nov 17 '22 15:11 haakonjacobsen

From what you wrote I assume you are looking for a way to translate the English parts into English text, and the Japanese parts into Japanese text.

Not translate but transcribe, but yeah - I just want transcripts out of bilingual conversations, without any translations.

Looking at the current outputs, the model seems to be able to figure out what's being spoken, but with additional translations that I don't need.

this would require a check to identify the parts of the audio that contains the different languages. E.g. splitting up the audio file into multiple parts

Yeah, but that's the exact thing I would like to avoid :)

Nov 17 '22 18:11 miyagawa

I also think that @haakonjacobsen is correct and the model probably does not support bilingual transcription.

However, I just tried the following command and it seemed like the initial transcription is exactly what you want:

./main -m ./models/ggml-small.bin -f ./mixed.wav -l ja

Click to expand

whisper_model_load: loading model from './models/ggml-small.bin'
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 3
whisper_model_load: mem_required  = 1044.00 MB
whisper_model_load: adding 1608 extra tokens
whisper_model_load: ggml ctx size = 464.56 MB
whisper_model_load: memory size =    68.48 MB
whisper_model_load: model size  =   464.44 MB

system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | 

main: processing './mixed.wav' (76542537 samples, 4783.9 sec), 4 threads, 1 processors, lang = ja, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.480]  じゃあ今日はあの バイリングアルニュースの2人
[00:00:03.480 --> 00:00:05.480]  またね、この間もやりましたけど
[00:00:05.480 --> 00:00:07.480]  僕がバイリングアルニュースに出て
[00:00:07.480 --> 00:00:11.820]  そういうのアフトショーをこちらのリビルドでやるという
[00:00:11.820 --> 00:00:14.380]  今の3時間
[00:00:14.380 --> 00:00:15.680]  撮ったアカナラ
[00:00:15.680 --> 00:00:17.680]  ちょっと喉も痛い
[00:00:17.680 --> 00:00:19.180]  キロコンパイ
[00:00:19.180 --> 00:00:20.080]  あるのかっていうね
[00:00:20.080 --> 00:00:22.080]  まあ、ちょろっとでいいんですけどね
[00:00:22.080 --> 00:00:27.500]  So you said you made some music
[00:00:27.500 --> 00:00:28.440]  when you were in
[00:00:28.440 --> 00:00:29.780]  So 今なんか
[00:00:29.780 --> 00:00:31.780]  Soft なんに使ってるって話をして
[00:00:31.780 --> 00:00:34.280]  Michaelはガレージバンドを使ってるんですよね
[00:00:34.280 --> 00:00:35.280]  僕は今
[00:00:35.280 --> 00:00:37.780]  Logic Proという
[00:00:37.780 --> 00:00:39.780]  AppleのSoftを使ってます
[00:00:39.780 --> 00:00:41.280]  高いですね
[00:00:41.280 --> 00:00:41.780]  高いね
[00:00:41.780 --> 00:00:43.780]  まあでもOne Time Feeだから
[00:00:43.780 --> 00:00:45.780]  そのAdobeのやつとかって
[00:00:45.780 --> 00:00:46.780]  毎月10ドルとか
[00:00:46.780 --> 00:00:48.780]  20ドルとかはらなきゃいけないやつけど
[00:00:48.780 --> 00:00:50.780]  それがないんでまあ1回切り出しいいかなと思って
[00:00:50.780 --> 00:00:54.280]  4、5年前に買ってそれからずっと使ってますね
[00:00:54.280 --> 00:00:57.280]  How much would it have cost you if you were paying
[00:00:57.280 --> 00:00:58.280]  Adobe
[00:00:58.280 --> 00:01:00.280]  Like 7
[00:01:00.280 --> 00:01:02.280]  Yeah, if you were paying subscription would you have paid more than
[00:01:02.280 --> 00:01:04.280]  Yeah, a lot more than that
[00:01:04.280 --> 00:01:06.280]  The price by now, yeah
[00:01:06.280 --> 00:01:08.280]  If you go for a Adobe audition for example
[00:01:08.280 --> 00:01:10.280]  I think it costs like 10 dollars a month
[00:01:10.280 --> 00:01:12.280]  So if you use it for 4 years
[00:01:12.280 --> 00:01:14.280]  It's gonna be 480 dollars
[00:01:14.280 --> 00:01:16.280]  Right
[00:01:16.280 --> 00:01:18.280]  And Apple, it's One Time Purchase
[00:01:18.280 --> 00:01:20.280]  And there's no upgrade fee
[00:01:20.280 --> 00:01:22.280]  So you can just upgrade for free
[00:01:22.280 --> 00:01:24.280]  How much is it?
[00:01:24.280 --> 00:01:26.280]  多分200ドルだったと思います
[00:01:26.280 --> 00:01:30.280]  Less than less software is even available for One Time
[00:01:30.280 --> 00:01:32.280]  Purchase now
[00:01:32.280 --> 00:01:34.280]  I feel like
[00:01:34.280 --> 00:01:38.280]  A lot of software has turned into subscription only
[00:01:38.280 --> 00:01:42.280]  You prefer which or depends on the price I guess
[00:01:42.280 --> 00:01:44.280]  そうですね
[00:01:44.280 --> 00:01:46.280]  Userとしては1回切りの方が嬉しいんですけど
[00:01:46.280 --> 00:01:50.280]  Software開発している側からすると多分
[00:01:50.280 --> 00:01:53.280]  メジャーバージョンアップしないと
[00:01:53.280 --> 00:01:55.280]  お金が入ってこない
[00:01:55.280 --> 00:01:57.280]  レベルユーが入ってこないので
[00:01:57.280 --> 00:02:00.280]  前月3ドルとか5ドルとかで
[00:02:00.280 --> 00:02:05.280]  やった方が開発が進むってのがありますよね
[00:02:05.280 --> 00:02:07.280]  Yeah
[00:02:07.280 --> 00:02:10.280]  So, yeah
[00:02:10.280 --> 00:02:12.280]  I think that's much much better
[00:02:12.280 --> 00:02:14.280]  from the developer standpoint
[00:02:14.280 --> 00:02:15.280]  Yeah, absolutely
[00:02:15.280 --> 00:02:17.280]  business standpoint, yeah
[00:02:17.280 --> 00:02:19.280]  Even if it's
[00:02:19.280 --> 00:02:21.280]  You probably make more money too
[00:02:21.280 --> 00:02:23.280]  As long as you keep updating it
[00:02:23.280 --> 00:02:25.280]  Exactly, yeah
[00:02:25.280 --> 00:02:31.280]  Also, I think it helps for a new user to try it out
[00:02:31.280 --> 00:02:36.280]  Most of the expensive software like that has a free trial
[00:02:36.280 --> 00:02:38.280]  But for subscription model
[00:02:38.280 --> 00:02:41.280]  You can maybe use it for free for one month
[00:02:41.280 --> 00:02:45.280]  And then use it for $5 a month for example
[00:02:45.280 --> 00:02:47.280]  As compared to
[00:02:47.280 --> 00:02:49.280]  You have to pay 200 bucks
[00:02:49.280 --> 00:02:51.280]  in upfront
[00:02:51.280 --> 00:02:53.280]  That's a lot of money
[00:02:53.280 --> 00:02:55.280]  If you're not sure if you want to commit
[00:02:55.280 --> 00:02:59.280]  to using that software for a long time
[00:02:59.280 --> 00:03:02.280]  The trial thing doesn't mix as well
[00:03:02.280 --> 00:03:04.280]  with one time purchase I think
[00:03:04.280 --> 00:03:07.280]  Right, but I think Apple Logic Pro
[00:03:07.280 --> 00:03:09.280]  or Final Cut Pro
[00:03:09.280 --> 00:03:13.280]  provides 90 days free trial
[00:03:13.280 --> 00:03:16.280]  3か月とか無料で使えるんで
[00:03:16.280 --> 00:03:19.280]  あまりといいですね
[00:03:19.280 --> 00:03:23.280]  Is it a lot better than GarageBand?
[00:03:23.280 --> 00:03:30.280]  Yeah, for me, it's a lot more useful
[00:03:30.280 --> 00:03:32.280]  One thing I miss in Logic
[00:03:32.280 --> 00:03:34.280]  When I have to use GarageBand
[00:03:34.280 --> 00:03:36.280]  is that a lot of keyboard shortcuts
[00:03:36.280 --> 00:03:38.280]  are available and customizable
[00:03:38.280 --> 00:03:40.280]  free customizable
[00:03:40.280 --> 00:03:41.280]  から
[00:03:41.280 --> 00:03:44.280]  右手僕はトラックパット使ってるんですけど
[00:03:44.280 --> 00:03:45.280]  トラックパット
[00:03:45.280 --> 00:03:47.280]  Magicトラックパットで
[00:03:47.280 --> 00:03:51.280]  カットしたいところをクリックして選んで
[00:03:51.280 --> 00:03:53.280]  左手でキーボードで
[00:03:53.280 --> 00:03:56.280]  XとかGとかが出たかな
[00:03:56.280 --> 00:03:58.280]  それも全部自分でカスタマイドしてるんですけど
[00:03:58.280 --> 00:04:00.280]  そうするとカットして
[00:04:00.280 --> 00:04:02.280]  とかできるんですよ
[00:04:02.280 --> 00:04:04.280]  GarageBandは少しはできるけど
[00:04:04.280 --> 00:04:06.280]  基本的には全部手でクリックして
[00:04:06.280 --> 00:04:09.280]  右クリックして消すとか
[00:04:09.280 --> 00:04:10.280]  なるんで
[00:04:10.280 --> 00:04:13.280]  I think there's like
[00:04:13.280 --> 00:04:16.280]  Command B is like
[00:04:16.280 --> 00:04:18.280]  to use the blade
[00:04:18.280 --> 00:04:20.280]  Command T is to split
[00:04:20.280 --> 00:04:23.280]  and Command B, I don't know what it was
[00:04:23.280 --> 00:04:25.280]  is the splitter that's attached
[00:04:25.280 --> 00:04:27.280]  or maybe I'm mixing it up with Final Cut Pro
[00:04:27.280 --> 00:04:28.280]  I don't know
[00:04:28.280 --> 00:04:32.280]  I get confused between these
[00:04:32.280 --> 00:04:35.280]  and you did it, software
[00:04:35.280 --> 00:04:38.280]  So what kind of music were you making?
[00:04:38.280 --> 00:04:39.280]  I'm just curious
[00:04:39.280 --> 00:04:43.280]  昔は小学校
[00:04:43.280 --> 00:04:47.280]  小学校の時にエレクトーンを鳴らせたんですよ
[00:04:47.280 --> 00:04:49.280]  エレクトーンって分かります?
[00:04:49.280 --> 00:04:50.280]  うん
[00:04:50.280 --> 00:04:51.280]  どうした?
[00:04:51.280 --> 00:04:55.280]  エレクトーンってオルガみたいなやつなんですけど
[00:04:55.280 --> 00:04:57.280]  音がいろいろ出せて
[00:04:57.280 --> 00:04:59.280]  プログラムしてあって変えられて
[00:04:59.280 --> 00:05:02.280]  ピアノの音とか電子音とか
[00:05:02.280 --> 00:05:04.280]  上下に鍵盤があるんで
[00:05:04.280 --> 00:05:06.280]  右手でメロディ、左手でコード弾いて
[00:05:06.280 --> 00:05:08.280]  足もあるんですよ
[00:05:08.280 --> 00:05:11.280]  フットペダルでベース弾くみたいなやつで
[00:05:11.280 --> 00:05:14.280]  それを5年間鳴らっていて
[00:05:14.280 --> 00:05:17.280]  で、中学校になったら
[00:05:17.280 --> 00:05:20.280]  ベース買ってバンドにやってたんですけど
[00:05:20.280 --> 00:05:24.280]  同時にシンスサイザーも買ってもらって
[00:05:24.280 --> 00:05:28.280]  それでちょっと打ち込みっていうか
[00:05:28.280 --> 00:05:30.280]  エレクトリックミュージックみたいなの
[00:05:30.280 --> 00:05:31.280]  当時のね
[00:05:31.280 --> 00:05:35.280]  どんなジョンラのエレクトリックミュージック?
[00:05:35.280 --> 00:05:36.280]  そう思う
[00:05:36.280 --> 00:05:37.280]  アブストラック?
[00:05:37.280 --> 00:05:39.280]  いや、アブストラックはない
[00:05:39.280 --> 00:05:41.280]  ポップみたいなの
[00:05:41.280 --> 00:05:43.280]  当時はEDMの音楽がないけど
[00:05:43.280 --> 00:05:45.280]  でも、ハイスナイトの音楽に
[00:05:45.280 --> 00:05:47.280]  カタグローしてたら
[00:05:47.280 --> 00:05:49.280]  それに似てるけど
[00:05:49.280 --> 00:05:51.280]  でも、私は
[00:05:51.280 --> 00:05:53.280]  私の音楽を作っていなかった
[00:05:53.280 --> 00:05:55.280]  私は、
[00:05:55.280 --> 00:05:58.280]  コピー版の歌を作っていた
[00:05:58.280 --> 00:06:00.280]  コピー版のみたいな?
[00:06:00.280 --> 00:06:06.280]  そう、あまりオリジナルの音楽だった
[00:06:06.280 --> 00:06:09.280]  あまりオリジナルの音楽を作っていなかった
[00:06:09.280 --> 00:06:12.280]  コンピューターの音楽を作っていた
[00:06:12.280 --> 00:06:15.280]  あまりオリジナルの音楽を作っていた
[00:06:15.280 --> 00:06:18.280]  でも、私はキーボードがある
[00:06:18.280 --> 00:06:21.280]  コンピューターのタイプするキーボードじゃなくて
[00:06:21.280 --> 00:06:23.280]  音楽のキーボードも
[00:06:23.280 --> 00:06:25.280]  ちょっと前に買って
[00:06:25.280 --> 00:06:27.280]  パンデミックの時に買ったのかな

I tried only the small model - maybe it depends which size you use.

Edit: Just tried same command using medium model and it outputs only Japanese - strange .. 🤔

Nov 17 '22 18:11 ggerganov

Ah this is interesting. I was using the medium model, because medium usually outputs a much more accurate (usable) transcripts for Japanese audio.

The funny thing is, both the medium and small model work really well for the first few minutes to isolate Japanese and English audio, but if you keep that command (with the small model) running: you get an output like this:

[00:18:39.200 --> 00:19:03.200]  「よー」は、皆さんが20年前に言ったことです。
[00:19:03.200 --> 00:19:11.200]  「よー」は、LFGとLFGと「よー」と言うことができる。
[00:19:11.200 --> 00:19:13.200]  私もそう思います。
[00:19:13.200 --> 00:19:15.200]  私もそう思います。
[00:19:15.200 --> 00:19:17.200]  本当に?
[00:19:17.200 --> 00:19:19.200]  私もそう思います。
[00:19:19.200 --> 00:19:21.200]  「よー」は、何かを使うことができる。
[00:19:21.200 --> 00:19:23.200]  私たちは、何かを使うことができる。
[00:19:23.200 --> 00:19:25.200]  私たちは、何かを使うことができる。
[00:19:25.200 --> 00:19:27.200]  私たちは、何かを使うことができる。
[00:19:27.200 --> 00:19:29.200]  私たちは、何かを使うことができる。
[00:19:29.200 --> 00:19:31.200]  私たちは、何かを使うことができる。

The original audio around here is all in English, but the output is a translated version in Japanese, sometimes repeating the same text.

Nov 17 '22 18:11 miyagawa

It seems we can't enforce transcribe-only mode no matter what. Only the --model small --language fa --task transcribe seem to sometimes produce mixed-language transcription.

Dec 07 '22 10:12 pvonmoradi

sometimes repeating the same text.

In upstream repo, there are a couple of switches to tweak to fix this but AFAIK they are not implemented here.

Dec 07 '22 11:12 pvonmoradi

I believe this functionality is not really supported by the model

Apr 14 '23 16:04 ggerganov

whisper.cpp whisper.cpp copied to clipboard

language/translate doesn't work for mixed-language audio

whisper.cpp
whisper.cpp copied to clipboard