whisper.cpp `--diarize` flag is unreliable

`--diarize` flag is unreliable

Open savchenko opened this issue 2 years ago • 1 comments

trafficstars

Windows binary is from https://github.com/ggerganov/whisper.cpp/actions/runs/3596200207 ( https://github.com/ggerganov/whisper.cpp/commit/061fc81bd699cc3f7a66ecb5377cc4cfa24898f2 )

I have an audio of two speakers having a conversation split between left and right channels. There is no echo, audio bleed or so.

In the example below, 2nd line has sentences said by two separate speakers labelled as "speaker 1". In reality, Speaker 1 has finished with "...what the website is" and the next sentence, starting with "Because there's like..." belongs to the Speaker 0.

[00:24:18.160 --> 00:24:24.400]  (speaker 1) XXXXXXXXX can do XXXXXXXXX these things. And then also once they do machine learning stuff,
--[ this line ]--> [00:24:24.400 --> 00:24:30.800]  (speaker 1) it's basically what the website is. Because there's like our capabilities include XXXXXXXXX,
[00:24:30.800 --> 00:24:40.720]  (speaker 0) site analysis, and then installing PyTorch. Well, I do remember one thing that he has

Is there any other information you might need to localise the bug?

Dec 02 '22 06:12 savchenko

Waveform screenshot to check the separation:

Dec 02 '22 06:12 savchenko

Yes, this is expected. The implemented strategy is super basic and it cannot be expected to always work reliably. In this case it fails because a single text segment contains speech by both speakers, while the strategy assumes it will be only one speaking (https://github.com/ggerganov/whisper.cpp/issues/64#issuecomment-1304639213).

Dec 02 '22 18:12 ggerganov

whisper.cpp whisper.cpp copied to clipboard

`--diarize` flag is unreliable

whisper.cpp
whisper.cpp copied to clipboard