whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

[Feature] mark speakers/voices (diarization)

Open abelbabel opened this issue 1 year ago • 17 comments

Hi,

I'm not so much into the details of whisper or whisper.cpp and I don't know if it is currently even possible with the foundation, but it would be nice if speakers could be marked or speaker-changes / voice-changes.

This would be very handy when processing interviews, radio/tv shows, films, etc.

Kind regards, abelbabel

abelbabel avatar Oct 18 '22 09:10 abelbabel

I think its a very not easy task - about quality. I recomend use for this another model. But in my research of this field, now not exist very good open source solution for this. But u can check pyannote for this. Some already implemented it with whisper usage: https://github.com/Majdoddin/nlp

ArtyomZemlyak avatar Oct 19 '22 11:10 ArtyomZemlyak

yeah, also saw this

https://github.com/openai/whisper/discussions/264

Seems as if they do it with two runs: one for the spoken text, one for the speakers and then merging the results.

abelbabel avatar Oct 19 '22 12:10 abelbabel

Personally, id be more than happy for whisper to just do speaker detection based on left & right channels on a stereo audio file. But I can achieve this by just running it twice.

jaybinks avatar Nov 05 '22 11:11 jaybinks

@jaybinks This can be added very easily as a built-in option. A naive algorithm would be for each transcribed segment to measure the signal energy during the time interval for that segment in the 2 channels and predict the speaker based on which one is bigger.

ggerganov avatar Nov 05 '22 20:11 ggerganov