omi icon indicating copy to clipboard operation
omi copied to clipboard

unreliable recording and speaker diarization

Open kodjima33 opened this issue 1 month ago • 7 comments

I talked to 7 omi users (mobile) and top-1 issue heard was poor quality of diarization (speaker assignment) and reliability.

As Im writing this, i just had a conversation with someone for 20 minutes but only 2 !!!!!!!!!!!!!!! minutes were captured. In that conversation i was the only one who was speaking but our app didn't even assign my name

  • [ ] Check why omi didn't capture 80% of this conversation - make sure I understand the problem and then fix it
  • [ ] suggest Nik a solution for speaker diarization to make it THE BEST IN THE DAMN WORLD, don't be lazy even if you need to invent smth new
  • [ ] make sure that ALMOST EVERY CONVERSATION TITLE CONTAINS NAMES and THEY ARE CORRECT - see granola reference below
Image

kodjima33 avatar Nov 27 '25 02:11 kodjima33

do this today pls @beastoin

kodjima33 avatar Nov 27 '25 02:11 kodjima33

omi didn't capture 80% of this conversation

you used omi mac os right ?

beastoin avatar Nov 27 '25 02:11 beastoin

7 omi users (mobile) and top-1 issue heard was poor quality of diarization (speaker assignment) and reliability

i need all contacts so i can talk to them, especially i need their uid to check the data.

beastoin avatar Nov 27 '25 02:11 beastoin

@beastoin device

kodjima33 avatar Nov 27 '25 02:11 kodjima33

give me the app version and the firmware version, please

beastoin avatar Nov 27 '25 02:11 beastoin

@beastoin 1.0.78 471

12 firmware

kodjima33 avatar Nov 27 '25 08:11 kodjima33

Hey , I've been digging into the "poor diarization" and "missing audio" issues (re: Issue #2806). I believe the current reliability issues stem from two linked problems in the pipeline, and I'd like to propose a fix I'm working on:

1. The Root Cause of "Missing Audio" (VAD Gating) The current VAD/Silence thresholds (likely default Silero >0.5) are too aggressive for wearable audio, which often has varying distance-to-mic. This causes the system to treat "quiet" valid speech (like a friend across the table) as silence, dropping ~80% of some conversations.

  • Fix: Decouple VAD from the recording trigger. Implement a "soft" VAD with a lower threshold (0.3) and increased speech_pad_ms (500ms+) to capture the full context, not just loud peaks.

2. State-of-the-Art Diarization (Pyannote) Deepgram's streaming diarization struggles with single-channel wearable audio.

  • Fix: Switch to a Global Clustering approach using pyannote-audio. Instead of diarizing chunk-by-chunk (which forgets context), we generate embeddings for the whole session and cluster them. This fixes the "Speaker 0 / Speaker 1" flip-flopping.

3. "Granola-Tier" Smart Titles To solve the generic "Conversation #24" titles, I propose a post-processing step:

  • Pass the first 120s of the diarized transcript to the LLM with a specific prompt: "Extract distinct speaker names and rename this file 'Meeting with NAME'."

M-SRIKAR-VARDHAN avatar Nov 27 '25 08:11 M-SRIKAR-VARDHAN