whisper-diarization Diarization is not working fine for all audios

Diarization is not working fine for me. For some audios, all segments are identifies by Speaker 0. Are there some solutions to improve diarization quality ?

Dec 22 '23 09:12 Asma-droid

I had the same problem

Dec 28 '23 03:12 Yjppj

I just ran into this. Do you get the same results if you were to try the transcription on just a few minutes of your audio at a time?

I have an audio file that's 41 minutes long and has about 16 speakers (total), and every single speaker comes out as "Speaker 0" in the transcript. However, if I extract a 5-minute chunk of the audio file and transcribe just that chunk, I get pretty good results, with two similar-sounding speakers sometimes getting categorized together (understandable, though).

I have also noticed that I get different speaker identification in parallel vs. non-parallel diarize.py with the five-minute clip. For my clip, the parallel version looked more correct (the actual transcript was identical between the two; it was just the speaker identification that was more wrong in the non-parallel version).

Jul 28 '24 14:07 GuyPaddock

@GuyPaddock , can you upload the audio file to reproduce?

Jul 28 '24 15:07 MahmoudAshraf97

Sadly, it's a student project with audio taken from some commercial IP I can't share publicly, but if there's some private way to share it I could.

Jul 28 '24 15:07 GuyPaddock

Also, thank you for the fast reply and for your work on this project! I am happy to help any way that I can.

Jul 28 '24 15:07 GuyPaddock

you can upload it to a drive link and share it with me [email protected]

Jul 28 '24 15:07 MahmoudAshraf97

Shared! Let me know if you don't receive an email from Drive.

Jul 28 '24 16:07 GuyPaddock

Got it, I'm currently working om pushing some updates to this repo, I'll debug it after

Jul 28 '24 16:07 MahmoudAshraf97

Thanks!

Jul 28 '24 16:07 GuyPaddock

BTW I am not using stems because I got better transcription with this particular audio file without it. I'm also using Whisper v3. So, my command line looks like this:

python diarize_parallel.py --audio ./input.wav --language en --whisper-model large-v3 --device 'cuda' --no-stem

Jul 28 '24 16:07 GuyPaddock

@MahmoudAshraf97 Today I tried actually listening to the audio at the timestamps indicated in the SRT for the file that I shared with you and I'm finding that the times don't line up with the line that's been spoken. In many cases, the timestamps in the SRT are 5-15 seconds before the text that was transcribed, and might not even contain the entire line of dialogue. I wonder if that's impacting speaker attribution?

Aug 18 '24 02:08 GuyPaddock

whisper-diarization whisper-diarization copied to clipboard

Diarization is not working fine for all audios

whisper-diarization
whisper-diarization copied to clipboard