dsnote
dsnote copied to clipboard
speaker diarization
Dsnote is great for STT by using whisper. For audio samples with different persons speaking, e.g. podcasts, movies …, one ends up with a messy text because Whisper doesn’t do what’s called ‘speaker diarization’. That is, identifying one voice or another. It seems there is a solution to this. Maybe you want to check the following:
https://ultracrepidarian.phfactor.net/
Definitely, looks very interesting.
Processing pipeline seems to be as follows:
- Audio transcription => "words" + timestamps
- Audio segmentation => "speaker-id" + timestamps
- Matching "words" to "speaker-id" based on timestamps
Currently Speech Note does not recognize timestamps, so this has to be implemented, but I've already need timestamps for subtitles support.
"Audio segmentation" is done with extra model pyannote/segmentation-3.0. To download this model you need to have Hugging Face account which might be problematic.
I will investigate what can be done.
I am not really in need of this, but can imagine that other users might want this. I guess eventually there will be an easily accessible open-source audio segmentation model. Actually, it is quite surprising that there is so much available for users free of charge (including giving some personal information like an e-mail-address).
I'm actually hoping for diarization. my use case (discourse analytics) benefits from a stable differentiation.
I did some research to find out what is possible. It looks as follows:
- Almost everyone uses pyannote segmentation models for diarization. The models work well... but not perfectly. The main problem is that the models are made available under the MIT license, but to download them you have to agree to certain conditions. To download them from HuggingFace you have to have an account and agree to the following conditions:
I must say, I don't like it. Especially this "You need to agree to share your contact". Speech Note is a privacy focused application and "sharing your contact information" doesn't fit well.
- There is also “experimental” support for diarization in whisper.cpp. I love
whisper.cpp
and it's already integrated into Speech Note. The problem is that to use diarization you need to download a special model that combines STT and diarization. Currently this single model is only for English :/
I'm keep looking for another better solution...
You are right. Privacy is an important asset. Regarding the “experimental” support for diarization in whisper.cpp, I think you should give it a go, if it is not too difficult to implement, and even if it is only for English at the moment.
I'd need it for German … I checked the thread above, from what I understand it's still rather experimental. As a user I would be OK to accept the terms for being able to diarize – as long as there's no viable alternative.
I found this repo, maybe it could be leveraged somehow https://github.com/m-bain/whisperX but it seems to rely on pyannote as well.
Yes, WhisperX uses the same pyannote models. Therefore you have to pass HF token to use diarization :(
Just learned about this
https://joss.theoj.org/papers/10.21105/joss.05266
Diart: A Python Library for Real-Time Speaker Diarization
@devSJR unfortunately same pyannote models are needed to make it work :(
https://github.com/juanmc2005/diart?tab=readme-ov-file#get-access-to--pyannote-models
OK, will keep looking
Possibly relevant sources (some are more directly usable than others):