whisper.cpp
whisper.cpp copied to clipboard
[Feature] mark speakers/voices (diarization)
Hi,
I'm not so much into the details of whisper or whisper.cpp and I don't know if it is currently even possible with the foundation, but it would be nice if speakers could be marked or speaker-changes / voice-changes.
This would be very handy when processing interviews, radio/tv shows, films, etc.
Kind regards, abelbabel
I think its a very not easy task - about quality. I recomend use for this another model. But in my research of this field, now not exist very good open source solution for this. But u can check pyannote for this. Some already implemented it with whisper usage: https://github.com/Majdoddin/nlp
yeah, also saw this
https://github.com/openai/whisper/discussions/264
Seems as if they do it with two runs: one for the spoken text, one for the speakers and then merging the results.
Personally, id be more than happy for whisper to just do speaker detection based on left & right channels on a stereo audio file. But I can achieve this by just running it twice.
@jaybinks This can be added very easily as a built-in option. A naive algorithm would be for each transcribed segment to measure the signal energy during the time interval for that segment in the 2 channels and predict the speaker based on which one is bigger.