Speaker diarization
Specific Demand
Kalosm currently has support for audio transcription with the Whisper model, but the results can be difficult to use without information about what transcripts belong to what speakers. Speach diarization segements audio segments into different speaker tags
Implement Suggestion
It looks like there are two main approaches for speach diarization in python:
- TitaNet: Nemo Speaker Diarization uses this model
- voxceleb-resnet: Pyannotate uses this model
It looks like there is an existing library that implements the pyannotate pipeline here we could try to integrate that model with kalosm streams or create an implementation of the same pipeline in candle
I’d also love this. Maybe the pipeline of https://github.com/m-bain/whisperX can be a good reference?
This probably is not the best first contribution-issue i think, especially since I’m not the best Rust dev. Is there any way I can support this feature development?
Porting the pyannotate model (voxceleb-resnet) and clustering pipeline to candle seems like the most difficult piece here integrating that into whisper is relatively easy