WhisperLiveKit
WhisperLiveKit copied to clipboard
Diarization quality
I tested the service now and I noticed the the diarization is pretty bad. It only works when I use the --backend whisper_timestamped and it often splits up one speaker into multiple. Is this the current state of the art / expected by the model or am I doing something wrong?
Hi, Yes, open-source live diarization solutions often struggle, especially at the start of conversations. You can try using Diart directly with:
from diart import SpeakerDiarization
from diart.sources import MicrophoneAudioSource
from diart.inference import StreamingInference
pipeline = SpeakerDiarization()
mic = MicrophoneAudioSource()
inference = StreamingInference(pipeline, mic, do_plot=True)
prediction = inference()
This will let you watch live speaker identification, and test different models available here: https://github.com/juanmc2005/diart?tab=readme-ov-file#-models
If you notice any difference in results between using Diart directly and through WhisperLiveKit, please let me know!