omi icon indicating copy to clipboard operation
omi copied to clipboard

Improve Speaker Diarization Accuracy ($1500)

Open aaravgarg opened this issue 4 months ago β€’ 19 comments

We need improved speaker diarization. Right now, we just rely on Deepgram for speaker diarization, which is not good enough.

We need to look into libraries like Pyannote or an equivalent solution to add another layer of diarization for much-improved results and higher accuracy.

The error rate in diarization should be reduced to at least half of what it currently is.

To claim bounty you should also:

  • find our current diarization error rate
  • compare your solution to our current error rate in speaker detection accuracy
  • your error rate in speaker labelling must be <50% of our current system's error rate in speaker labelling

aaravgarg avatar Aug 15 '25 23:08 aaravgarg

@Hormold

aaravgarg avatar Aug 19 '25 05:08 aaravgarg

@aaravgarg If no one working, I can take this

krushnarout avatar Aug 22 '25 03:08 krushnarout

@aaravgarg If no one working, I can take this

Sure, feel free to try out. What's your action plan for this btw

aaravgarg avatar Aug 22 '25 03:08 aaravgarg

Is this still being worked on?

andresgomezsar avatar Sep 04 '25 02:09 andresgomezsar

Hi @aaravgarg Ankit here. Here is my PR: https://github.com/BasedHardware/omi/pull/3021 Branch Name: ThakurAnkitSingh:feat/fix-speaker-accuracy

ThakurAnkitSingh avatar Sep 20 '25 20:09 ThakurAnkitSingh

Hi @aaravgarg & @mdmohsin7 , So I closed the #3021 PR because it included unintended file changes. I opened a clean one with only the relevant updates. New PR: https://github.com/BasedHardware/omi/pull/3032. Please review it and let me know your thoughts.

Thanks for your time!

ThakurAnkitSingh avatar Sep 21 '25 21:09 ThakurAnkitSingh

Hi @aaravgarg, could you please tag @mdmohsin7 to review my PR when you have a chance? I’ve made the requested changes and would appreciate feedback whenever there’s an opportunity. πŸš€ App Audio β†’ WebSocket β†’ /v4/listen β†’ _listen() β†’ process_audio_dg() β†’ Deepgram β†’ on_message() β†’ [OUR ENHANCEMENT] β†’ stream_transcript() β†’ WebSocket β†’ App β†’ UI Update

PR: https://github.com/BasedHardware/omi/pull/3032

ThakurAnkitSingh avatar Sep 22 '25 20:09 ThakurAnkitSingh

Hi @aaravgarg, could you please tag @mdmohsin7 to review my PR when you have a chance? I’ve made the requested changes and would appreciate feedback whenever there’s an opportunity. πŸš€ App Audio β†’ WebSocket β†’ /v4/listen β†’ _listen() β†’ process_audio_dg() β†’ Deepgram β†’ on_message() β†’ [OUR ENHANCEMENT] β†’ stream_transcript() β†’ WebSocket β†’ App β†’ UI Update

PR: #3032

Please share your accuracy results; before and after comparison with our current diarization accuracy

Don't remember to test this data on the omi and not on the phone mic since that represents more closely the real-world scenario

aaravgarg avatar Sep 30 '25 22:09 aaravgarg

Bounty is still open btw guys feel free to submit PRs, very important to solve

aaravgarg avatar Sep 30 '25 22:09 aaravgarg

@Hormold @andresgomezsar you guys still interested in this one?

aaravgarg avatar Sep 30 '25 22:09 aaravgarg

Increasing bounty to $1500 on this one btw @everyone

aaravgarg avatar Sep 30 '25 22:09 aaravgarg

Yes still interested! But hasn't someone else implemented this? Don't want to work on this to be for nothing... Are the bounties locked to a single dev?

andresgomezsar avatar Sep 30 '25 23:09 andresgomezsar

currently on this.

daveads avatar Oct 01 '25 15:10 daveads

Yes still interested! But hasn't someone else implemented this? Don't want to work on this to be for nothing... Are the bounties locked to a single dev?

Nobody has shown a good plan for this, if someone does, will lock

aaravgarg avatar Oct 01 '25 19:10 aaravgarg


The whole plan has three main steps: first, find out how good the current system is; second, build a better system; and third, prove that the new system is actually better.


Phase 1: Check the Current System (The Baseline)

Before I can improve anything, I need to know exactly what the current problem is.

  1. Find the Current Mistake Score: I will use a measurement called Diarization Error Rate (DER) to count how many mistakes the system makes when figuring out who is speaking.
  2. Create the Perfect Answer (Ground Truth): I will take a set of audio files and use a powerful tool called pyannote.audio to create a correct answer key for who spoke when. This is my standard.
  3. Test the Current System: I will run the existing Deepgram system on the same audio files.
  4. Get the Starting Score: I will compare the Deepgram results to the perfect answer key and write down the mistake score (DER). This is my baselineβ€”the score I must beat.

Phase 2: Build the Improved Solution

  1. I will use pyannote.audio for speaker identification.

  2. Combine Systems:

    • Step 1: Deepgram does its normal work (transcribing and initial speaker labeling).
    • Step 2: The pyannote tool steps in to fix Deepgram's mistakes, making the speaker labels much more precise.

Phase 3: Prove the Improvement

Finally, I will test the new system the same way I tested the old one to make sure it's better.

  1. Test the New System: I will run the combined Deepgram + Pyannote system on the exact same audio files.
  2. Find the error rate
  3. Compare and Confirm: Then I will check the old score and the new score side by side. The goal is to cut the mistake rate by at least 50% like the issue suggested

@aaravgarg i briefly explained how i was going to go about this on discord if you remembered...

daveads avatar Oct 01 '25 20:10 daveads

@daveads from my experience with one weekend project (diarizing youtube interviewers, for the purpose of speaker-specific playback rates), I doubt pyannote.audio is good enough for ground truth, at least not without a lot of effort and experience with it.

@aaravgarg this task description should clarify whether test-sets from freely-available corpuses (there are plenty out there), with the audio part played back through speakers for the omi device to listen to, would suffice, or whether that's acceptable as a part of the evaluation.

but also, seems like you might as well use a best-in-class commercial service for diarization on some tens or hundreds of hours of Omi-recorded audio from devs/insiders. of course, takes some time to discover what "best-in-class commercial service" is. i'd wager some google service is among them.

DustinWehr avatar Oct 06 '25 22:10 DustinWehr

@daveads from my experience with one weekend project (diarizing youtube interviewers, for the purpose of speaker-specific playback rates), I doubt pyannote.audio is good enough for ground truth, at least not without a lot of effort and experience with it.

@aaravgarg this task description should clarify whether test-sets from freely-available corpuses (there are plenty out there), with the audio part played back through speakers for the omi device to listen to, would suffice, or whether that's acceptable as a part of the evaluation.

but also, seems like you might as well use a best-in-class commercial service for diarization on some tens or hundreds of hours of Omi-recorded audio from devs/insiders. of course, takes some time to discover what "best-in-class commercial service" is. i'd wager some google service is among them.

Basically, I'm just experimenting to see what works best for the current situation.

daveads avatar Oct 06 '25 22:10 daveads

https://github.com/BasedHardware/omi/pull/3280

neooriginal avatar Oct 25 '25 17:10 neooriginal

is this bounty still open?

sivanimohan avatar Nov 27 '25 14:11 sivanimohan