Improve Speaker Diarization Accuracy ($1500)
We need improved speaker diarization. Right now, we just rely on Deepgram for speaker diarization, which is not good enough.
We need to look into libraries like Pyannote or an equivalent solution to add another layer of diarization for much-improved results and higher accuracy.
The error rate in diarization should be reduced to at least half of what it currently is.
To claim bounty you should also:
- find our current diarization error rate
- compare your solution to our current error rate in speaker detection accuracy
- your error rate in speaker labelling must be <50% of our current system's error rate in speaker labelling
@Hormold
@aaravgarg If no one working, I can take this
@aaravgarg If no one working, I can take this
Sure, feel free to try out. What's your action plan for this btw
Is this still being worked on?
Hi @aaravgarg Ankit here. Here is my PR: https://github.com/BasedHardware/omi/pull/3021 Branch Name: ThakurAnkitSingh:feat/fix-speaker-accuracy
Hi @aaravgarg & @mdmohsin7 , So I closed the #3021 PR because it included unintended file changes. I opened a clean one with only the relevant updates. New PR: https://github.com/BasedHardware/omi/pull/3032. Please review it and let me know your thoughts.
Thanks for your time!
Hi @aaravgarg, could you please tag @mdmohsin7 to review my PR when you have a chance? Iβve made the requested changes and would appreciate feedback whenever thereβs an opportunity. π
App Audio β WebSocket β /v4/listen β _listen() β process_audio_dg() β Deepgram β on_message() β [OUR ENHANCEMENT] β stream_transcript() β WebSocket β App β UI Update
PR: https://github.com/BasedHardware/omi/pull/3032
Hi @aaravgarg, could you please tag @mdmohsin7 to review my PR when you have a chance? Iβve made the requested changes and would appreciate feedback whenever thereβs an opportunity. π
App Audio β WebSocket β /v4/listen β _listen() β process_audio_dg() β Deepgram β on_message() β [OUR ENHANCEMENT] β stream_transcript() β WebSocket β App β UI UpdatePR: #3032
Please share your accuracy results; before and after comparison with our current diarization accuracy
Don't remember to test this data on the omi and not on the phone mic since that represents more closely the real-world scenario
Bounty is still open btw guys feel free to submit PRs, very important to solve
@Hormold @andresgomezsar you guys still interested in this one?
Increasing bounty to $1500 on this one btw @everyone
Yes still interested! But hasn't someone else implemented this? Don't want to work on this to be for nothing... Are the bounties locked to a single dev?
currently on this.
Yes still interested! But hasn't someone else implemented this? Don't want to work on this to be for nothing... Are the bounties locked to a single dev?
Nobody has shown a good plan for this, if someone does, will lock
The whole plan has three main steps: first, find out how good the current system is; second, build a better system; and third, prove that the new system is actually better.
Phase 1: Check the Current System (The Baseline)
Before I can improve anything, I need to know exactly what the current problem is.
- Find the Current Mistake Score: I will use a measurement called Diarization Error Rate (DER) to count how many mistakes the system makes when figuring out who is speaking.
- Create the Perfect Answer (Ground Truth): I will take a set of audio files and use a powerful tool called
pyannote.audioto create a correct answer key for who spoke when. This is my standard. - Test the Current System: I will run the existing Deepgram system on the same audio files.
- Get the Starting Score: I will compare the Deepgram results to the perfect answer key and write down the mistake score (DER). This is my baselineβthe score I must beat.
Phase 2: Build the Improved Solution
-
I will use
pyannote.audiofor speaker identification. -
Combine Systems:
- Step 1: Deepgram does its normal work (transcribing and initial speaker labeling).
- Step 2: The
pyannotetool steps in to fix Deepgram's mistakes, making the speaker labels much more precise.
Phase 3: Prove the Improvement
Finally, I will test the new system the same way I tested the old one to make sure it's better.
- Test the New System: I will run the combined Deepgram + Pyannote system on the exact same audio files.
- Find the error rate
- Compare and Confirm: Then I will check the old score and the new score side by side. The goal is to cut the mistake rate by at least 50% like the issue suggested
@aaravgarg i briefly explained how i was going to go about this on discord if you remembered...
@daveads from my experience with one weekend project (diarizing youtube interviewers, for the purpose of speaker-specific playback rates), I doubt pyannote.audio is good enough for ground truth, at least not without a lot of effort and experience with it.
@aaravgarg this task description should clarify whether test-sets from freely-available corpuses (there are plenty out there), with the audio part played back through speakers for the omi device to listen to, would suffice, or whether that's acceptable as a part of the evaluation.
but also, seems like you might as well use a best-in-class commercial service for diarization on some tens or hundreds of hours of Omi-recorded audio from devs/insiders. of course, takes some time to discover what "best-in-class commercial service" is. i'd wager some google service is among them.
@daveads from my experience with one weekend project (diarizing youtube interviewers, for the purpose of speaker-specific playback rates), I doubt pyannote.audio is good enough for ground truth, at least not without a lot of effort and experience with it.
@aaravgarg this task description should clarify whether test-sets from freely-available corpuses (there are plenty out there), with the audio part played back through speakers for the omi device to listen to, would suffice, or whether that's acceptable as a part of the evaluation.
but also, seems like you might as well use a best-in-class commercial service for diarization on some tens or hundreds of hours of Omi-recorded audio from devs/insiders. of course, takes some time to discover what "best-in-class commercial service" is. i'd wager some google service is among them.
Basically, I'm just experimenting to see what works best for the current situation.
https://github.com/BasedHardware/omi/pull/3280
is this bounty still open?