whisper-diarization
whisper-diarization copied to clipboard
diarization issue: All dialouges got speaker 0 only.
Below is audio file to reproduce the issue. Audio.
Actual output.
Speaker Name,in,out,Text
Speaker 0,00:04:41.4,00:07:32.15,You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You I will give you my feedback.
Speaker 0,00:07:36.0,00:07:36.7,Okay.
Speaker 0,00:07:36.8,00:07:36.22,"All right, dear."
Speaker 0,00:07:37.4,00:07:38.21,So let's start with today's class.
Speaker 0,00:07:39.7,00:07:54.15,"And we are going to do C six today and C five you did with some other teacher, right?"
Speaker 0,00:07:55.14,00:07:55.19,Yeah.
Speaker 0,00:07:56.13,00:07:56.22,Okay.
Speaker 0,00:07:57.10,00:08:00.22,"Yeah, because I was on well, so I canceled the class."
Speaker 0,00:08:00.23,00:08:02.7,So you did it with the other teacher.
Speaker 0,00:08:02.18,00:08:02.24,Yes.
Speaker 0,00:08:04.10,00:08:05.1,"Okay, great."
Speaker 0,00:08:05.7,00:08:06.9,So you understood that?
Speaker 0,00:08:08.12,00:08:10.11,Can you tell me you understood that?
Speaker 0,00:08:10.13,00:08:13.19,Can you tell me what concept did you learn in the last class?
Speaker 0,00:08:14.10,00:08:17.5,"Yeah, I didn't understand it."
Speaker 0,00:08:22.15,00:08:23.19,You didn't understand that?
Speaker 0,00:08:24.16,00:08:25.19,I understand it.
Speaker 0,00:08:26.7,00:08:27.21,"Okay, so what was it?"
Speaker 0,00:08:28.2,00:08:31.4,Can you tell me which game you created in that class?
Speaker 0,00:08:32.11,00:08:34.11,Chasing the mouse.
Speaker 0,00:08:34.16,00:08:38.7,"Oh, that's an interesting game."
Speaker 0,00:08:38.8,00:08:38.11,Yes.
Speaker 0,00:08:49.1,00:08:49.6,Good.
Speaker 0,00:08:49.7,00:08:49.23,Fantastic.
Yep, I got the same error, have you found the issue?
Yet not solution?
I have the same issue and investigated. It appears that the "speaker 0" for all lines is the direct output of the underlying diarization model, Nemo Toolkits: nemo.collections.asr.models.msdd_models.NeuralDiarizer. So there is a bug in the nemo toolkit, not this library. We all might be better off trying to use pyannote for the diarization.
I found that the problem come from model quality of Nemo: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/diar_msdd_telephonic
Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.
Had the same with demucs. Disabling it (--no-stem) helped.