floneum Speaker diarization

Specific Demand

Kalosm currently has support for audio transcription with the Whisper model, but the results can be difficult to use without information about what transcripts belong to what speakers. Speach diarization segements audio segments into different speaker tags

Implement Suggestion

It looks like there are two main approaches for speach diarization in python:

TitaNet: Nemo Speaker Diarization uses this model
voxceleb-resnet: Pyannotate uses this model

It looks like there is an existing library that implements the pyannotate pipeline here we could try to integrate that model with kalosm streams or create an implementation of the same pipeline in candle

Sep 07 '24 17:09 ealmloff

I’d also love this. Maybe the pipeline of https://github.com/m-bain/whisperX can be a good reference?

Jun 02 '25 15:06 ericwaetke

This probably is not the best first contribution-issue i think, especially since I’m not the best Rust dev. Is there any way I can support this feature development?

Jun 02 '25 16:06 ericwaetke

Porting the pyannotate model (voxceleb-resnet) and clustering pipeline to candle seems like the most difficult piece here integrating that into whisper is relatively easy

Jun 03 '25 02:06 ealmloff