whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

`--diarize` flag is unreliable

Open savchenko opened this issue 2 years ago • 1 comments
trafficstars

Windows binary is from https://github.com/ggerganov/whisper.cpp/actions/runs/3596200207 ( https://github.com/ggerganov/whisper.cpp/commit/061fc81bd699cc3f7a66ecb5377cc4cfa24898f2 )

I have an audio of two speakers having a conversation split between left and right channels. There is no echo, audio bleed or so.

In the example below, 2nd line has sentences said by two separate speakers labelled as "speaker 1". In reality, Speaker 1 has finished with "...what the website is" and the next sentence, starting with "Because there's like..." belongs to the Speaker 0.

[00:24:18.160 --> 00:24:24.400]  (speaker 1) XXXXXXXXX can do XXXXXXXXX these things. And then also once they do machine learning stuff,
--[ this line ]--> [00:24:24.400 --> 00:24:30.800]  (speaker 1) it's basically what the website is. Because there's like our capabilities include XXXXXXXXX,
[00:24:30.800 --> 00:24:40.720]  (speaker 0) site analysis, and then installing PyTorch. Well, I do remember one thing that he has

Is there any other information you might need to localise the bug?

savchenko avatar Dec 02 '22 06:12 savchenko

Waveform screenshot to check the separation:

image

savchenko avatar Dec 02 '22 06:12 savchenko

Yes, this is expected. The implemented strategy is super basic and it cannot be expected to always work reliably. In this case it fails because a single text segment contains speech by both speakers, while the strategy assumes it will be only one speaking (https://github.com/ggerganov/whisper.cpp/issues/64#issuecomment-1304639213).

ggerganov avatar Dec 02 '22 18:12 ggerganov