dsnote speaker diarization

Dsnote is great for STT by using whisper. For audio samples with different persons speaking, e.g. podcasts, movies …, one ends up with a messy text because Whisper doesn’t do what’s called ‘speaker diarization’. That is, identifying one voice or another. It seems there is a solution to this. Maybe you want to check the following:

https://ultracrepidarian.phfactor.net/

Dec 30 '23 22:12 devSJR

Definitely, looks very interesting.

Processing pipeline seems to be as follows:

Audio transcription => "words" + timestamps
Audio segmentation => "speaker-id" + timestamps
Matching "words" to "speaker-id" based on timestamps

Currently Speech Note does not recognize timestamps, so this has to be implemented, but I've already need timestamps for subtitles support.

"Audio segmentation" is done with extra model pyannote/segmentation-3.0. To download this model you need to have Hugging Face account which might be problematic.

I will investigate what can be done.

Dec 31 '23 14:12 mkiol

I am not really in need of this, but can imagine that other users might want this. I guess eventually there will be an easily accessible open-source audio segmentation model. Actually, it is quite surprising that there is so much available for users free of charge (including giving some personal information like an e-mail-address).

Dec 31 '23 15:12 devSJR

I'm actually hoping for diarization. my use case (discourse analytics) benefits from a stable differentiation.

May 22 '24 11:05 thob

I did some research to find out what is possible. It looks as follows:

Almost everyone uses pyannote segmentation models for diarization. The models work well... but not perfectly. The main problem is that the models are made available under the MIT license, but to download them you have to agree to certain conditions. To download them from HuggingFace you have to have an account and agree to the following conditions:

I must say, I don't like it. Especially this "You need to agree to share your contact". Speech Note is a privacy focused application and "sharing your contact information" doesn't fit well.

There is also “experimental” support for diarization in whisper.cpp. I love whisper.cpp and it's already integrated into Speech Note. The problem is that to use diarization you need to download a special model that combines STT and diarization. Currently this single model is only for English :/

I'm keep looking for another better solution...

May 22 '24 18:05 mkiol

You are right. Privacy is an important asset. Regarding the “experimental” support for diarization in whisper.cpp, I think you should give it a go, if it is not too difficult to implement, and even if it is only for English at the moment.

May 22 '24 20:05 devSJR

I'd need it for German … I checked the thread above, from what I understand it's still rather experimental. As a user I would be OK to accept the terms for being able to diarize – as long as there's no viable alternative.

May 23 '24 07:05 thob

I found this repo, maybe it could be leveraged somehow https://github.com/m-bain/whisperX but it seems to rely on pyannote as well.

May 23 '24 07:05 thob

Yes, WhisperX uses the same pyannote models. Therefore you have to pass HF token to use diarization :(

May 25 '24 16:05 mkiol

Just learned about this

https://joss.theoj.org/papers/10.21105/joss.05266

Diart: A Python Library for Real-Time Speaker Diarization

Jul 07 '24 16:07 devSJR

@devSJR unfortunately same pyannote models are needed to make it work :(

https://github.com/juanmc2005/diart?tab=readme-ov-file#get-access-to--pyannote-models

Jul 10 '24 18:07 mkiol

OK, will keep looking

Jul 10 '24 18:07 devSJR

Possibly relevant sources (some are more directly usable than others):

Jul 17 '24 11:07 machiav3lli

dsnote dsnote copied to clipboard

speaker diarization

dsnote
dsnote copied to clipboard