whisper-diarization icon indicating copy to clipboard operation
whisper-diarization copied to clipboard

Diarization is not working fine for all audios

Open Asma-droid opened this issue 1 year ago • 11 comments

Diarization is not working fine for me. For some audios, all segments are identifies by Speaker 0. Are there some solutions to improve diarization quality ?

Asma-droid avatar Dec 22 '23 09:12 Asma-droid

I had the same problem

Yjppj avatar Dec 28 '23 03:12 Yjppj

I just ran into this. Do you get the same results if you were to try the transcription on just a few minutes of your audio at a time?

I have an audio file that's 41 minutes long and has about 16 speakers (total), and every single speaker comes out as "Speaker 0" in the transcript. However, if I extract a 5-minute chunk of the audio file and transcribe just that chunk, I get pretty good results, with two similar-sounding speakers sometimes getting categorized together (understandable, though).

I have also noticed that I get different speaker identification in parallel vs. non-parallel diarize.py with the five-minute clip. For my clip, the parallel version looked more correct (the actual transcript was identical between the two; it was just the speaker identification that was more wrong in the non-parallel version).

GuyPaddock avatar Jul 28 '24 14:07 GuyPaddock

@GuyPaddock , can you upload the audio file to reproduce?

MahmoudAshraf97 avatar Jul 28 '24 15:07 MahmoudAshraf97

Sadly, it's a student project with audio taken from some commercial IP I can't share publicly, but if there's some private way to share it I could.

GuyPaddock avatar Jul 28 '24 15:07 GuyPaddock

Also, thank you for the fast reply and for your work on this project! I am happy to help any way that I can.

GuyPaddock avatar Jul 28 '24 15:07 GuyPaddock

you can upload it to a drive link and share it with me [email protected]

MahmoudAshraf97 avatar Jul 28 '24 15:07 MahmoudAshraf97

Shared! Let me know if you don't receive an email from Drive.

GuyPaddock avatar Jul 28 '24 16:07 GuyPaddock

Got it, I'm currently working om pushing some updates to this repo, I'll debug it after

MahmoudAshraf97 avatar Jul 28 '24 16:07 MahmoudAshraf97

Thanks!

GuyPaddock avatar Jul 28 '24 16:07 GuyPaddock

BTW I am not using stems because I got better transcription with this particular audio file without it. I'm also using Whisper v3. So, my command line looks like this:

python diarize_parallel.py --audio ./input.wav --language en --whisper-model large-v3 --device 'cuda' --no-stem

GuyPaddock avatar Jul 28 '24 16:07 GuyPaddock

@MahmoudAshraf97 Today I tried actually listening to the audio at the timestamps indicated in the SRT for the file that I shared with you and I'm finding that the times don't line up with the line that's been spoken. In many cases, the timestamps in the SRT are 5-15 seconds before the text that was transcribed, and might not even contain the entire line of dialogue. I wonder if that's impacting speaker attribution?

GuyPaddock avatar Aug 18 '24 02:08 GuyPaddock