whisper-diarization icon indicating copy to clipboard operation
whisper-diarization copied to clipboard

diarization issue: All dialouges got speaker 0 only.

Open manjunath7472 opened this issue 1 year ago • 5 comments

Below is audio file to reproduce the issue. Audio.

Actual output.

Speaker Name,in,out,Text

Speaker 0,00:04:41.4,00:07:32.15,You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You I will give you my feedback.

Speaker 0,00:07:36.0,00:07:36.7,Okay.

Speaker 0,00:07:36.8,00:07:36.22,"All right, dear."

Speaker 0,00:07:37.4,00:07:38.21,So let's start with today's class.

Speaker 0,00:07:39.7,00:07:54.15,"And we are going to do C six today and C five you did with some other teacher, right?"

Speaker 0,00:07:55.14,00:07:55.19,Yeah.

Speaker 0,00:07:56.13,00:07:56.22,Okay.

Speaker 0,00:07:57.10,00:08:00.22,"Yeah, because I was on well, so I canceled the class."

Speaker 0,00:08:00.23,00:08:02.7,So you did it with the other teacher.

Speaker 0,00:08:02.18,00:08:02.24,Yes.

Speaker 0,00:08:04.10,00:08:05.1,"Okay, great."

Speaker 0,00:08:05.7,00:08:06.9,So you understood that?

Speaker 0,00:08:08.12,00:08:10.11,Can you tell me you understood that?

Speaker 0,00:08:10.13,00:08:13.19,Can you tell me what concept did you learn in the last class?

Speaker 0,00:08:14.10,00:08:17.5,"Yeah, I didn't understand it."

Speaker 0,00:08:22.15,00:08:23.19,You didn't understand that?

Speaker 0,00:08:24.16,00:08:25.19,I understand it.

Speaker 0,00:08:26.7,00:08:27.21,"Okay, so what was it?"

Speaker 0,00:08:28.2,00:08:31.4,Can you tell me which game you created in that class?

Speaker 0,00:08:32.11,00:08:34.11,Chasing the mouse.

Speaker 0,00:08:34.16,00:08:38.7,"Oh, that's an interesting game."

Speaker 0,00:08:38.8,00:08:38.11,Yes.

Speaker 0,00:08:49.1,00:08:49.6,Good.

Speaker 0,00:08:49.7,00:08:49.23,Fantastic.

manjunath7472 avatar Oct 19 '23 11:10 manjunath7472

Yep, I got the same error, have you found the issue?

v-nhandt21 avatar Nov 01 '23 09:11 v-nhandt21

Yet not solution?

solucionesuno avatar Nov 23 '23 16:11 solucionesuno

I have the same issue and investigated. It appears that the "speaker 0" for all lines is the direct output of the underlying diarization model, Nemo Toolkits: nemo.collections.asr.models.msdd_models.NeuralDiarizer. So there is a bug in the nemo toolkit, not this library. We all might be better off trying to use pyannote for the diarization.

rbracco avatar Nov 29 '23 22:11 rbracco

I found that the problem come from model quality of Nemo: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/diar_msdd_telephonic

Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.

v-nhandt21 avatar Dec 01 '23 03:12 v-nhandt21

Had the same with demucs. Disabling it (--no-stem) helped.

kalisgd0 avatar Dec 15 '23 13:12 kalisgd0