whisper-diarization
whisper-diarization copied to clipboard
The Diarization does not work
I fed it a sample audio of 3 mins and 30 mins long. For both the audio inputs (.mp3), I get the same diarization result where all the text is attributed to a single speaker.
1
00:00:04,540 --> 30:03:25,791
Speaker 0:
saw this issue when input audio has cut abruptly. I tried extending and it worked. Have lots of glitches in this process. What is your audio length? Did u try diarizing for bigger length audio file?
In transcribe cell, we need to remove below lines to avoid kernel crash while transcribing. It also solves problem where all dialogues are stuffed under one speaker.
del whisper_model torch.cuda.empty_cache()
@manjunath7472 Yes, I have done that. Another question I have is that according to the configurations provided, there can be 3 types: meeting, telephonic, general but the msdd model for meeting & general is actually set to None. So, we cant use these configurations, means we can predict only 2 speakers. The msdd model with a valid path exists only for telephonic type : "diar_msdd_telephonic" .
Can you add more info on this ?
@projects-g , @manjunath7472 can you provide me the audio file to reproduce this issue?
the following lines just clear the gpu memory for the following steps, they have absolutely no effect on the results
del whisper_model
torch.cuda.empty_cache()
@MahmoudAshraf97 I understood that they don't have any effect other than on memory. I cannot share the file as it is huge. Will try to test with a another ~10 min input file as my initial one was only 2 minutes.
But, could be add any info about another part of my question ? About the msdd_model available only for the "telephonic" type and not for the other two (general, meeting). If one were to use the "meeting" or "general" type for diarization, how would one go about it ?
Below is requested audio file with same issue. Initially with default settings in transcribe() it attributes all dialouges to single speaker. Audio Then i added below to transcribe() and it transcribes fine.
vad_parameters=dict(threshold=0.4, max_speech_duration_s=15)
But diarization didn't cluster anything and it just labelled as speaker 0 for all dialouges.
Short Result below.
Speaker Name,in,out,Text
Speaker 0,00:04:41.4,00:07:32.15,You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You I will give you my feedback.
Speaker 0,00:07:36.0,00:07:36.7,Okay.
Speaker 0,00:07:36.8,00:07:36.22,"All right, dear."
Speaker 0,00:07:37.4,00:07:38.21,So let's start with today's class.
Speaker 0,00:07:39.7,00:07:54.15,"And we are going to do C six today and C five you did with some other teacher, right?"
Speaker 0,00:07:55.14,00:07:55.19,Yeah.
Speaker 0,00:07:56.13,00:07:56.22,Okay.
Speaker 0,00:07:57.10,00:08:00.22,"Yeah, because I was on well, so I canceled the class."
Speaker 0,00:08:00.23,00:08:02.7,So you did it with the other teacher.
Speaker 0,00:08:02.18,00:08:02.24,Yes.
Speaker 0,00:08:04.10,00:08:05.1,"Okay, great."
Speaker 0,00:08:05.7,00:08:06.9,So you understood that?
Speaker 0,00:08:08.12,00:08:10.11,Can you tell me you understood that?
Speaker 0,00:08:10.13,00:08:13.19,Can you tell me what concept did you learn in the last class?
Speaker 0,00:08:14.10,00:08:17.5,"Yeah, I didn't understand it."
Speaker 0,00:08:22.15,00:08:23.19,You didn't understand that?
Speaker 0,00:08:24.16,00:08:25.19,I understand it.
Since this model is only trained on telephonic speech, diarization performance on other acoustic conditions might show a degraded performance compared to telephonic speech.
I found the model from the NGC of Nvidia: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/diar_msdd_telephonic
For the generic and meeting, there is no model support for this.
I tried to clone the Nemo package and then print the prediction of the MSDD model, but it seems that the misprediction came from Nvidia model, not the implementation of the repo's author,
Anyway, I am trying to find out which audio is suitable for this model.
Train for msdd seem to be hard core :))
I have the same issue for long files! any idea please to solve the problem ?