floneum icon indicating copy to clipboard operation
floneum copied to clipboard

Speaker diarization

Open ealmloff opened this issue 1 year ago • 3 comments

Specific Demand

Kalosm currently has support for audio transcription with the Whisper model, but the results can be difficult to use without information about what transcripts belong to what speakers. Speach diarization segements audio segments into different speaker tags

Implement Suggestion

It looks like there are two main approaches for speach diarization in python:

  1. TitaNet: Nemo Speaker Diarization uses this model
  2. voxceleb-resnet: Pyannotate uses this model

It looks like there is an existing library that implements the pyannotate pipeline here we could try to integrate that model with kalosm streams or create an implementation of the same pipeline in candle

ealmloff avatar Sep 07 '24 17:09 ealmloff

I’d also love this. Maybe the pipeline of https://github.com/m-bain/whisperX can be a good reference?

ericwaetke avatar Jun 02 '25 15:06 ericwaetke

This probably is not the best first contribution-issue i think, especially since I’m not the best Rust dev. Is there any way I can support this feature development?

ericwaetke avatar Jun 02 '25 16:06 ericwaetke

Porting the pyannotate model (voxceleb-resnet) and clustering pipeline to candle seems like the most difficult piece here integrating that into whisper is relatively easy

ealmloff avatar Jun 03 '25 02:06 ealmloff