unreliable recording and speaker diarization
I talked to 7 omi users (mobile) and top-1 issue heard was poor quality of diarization (speaker assignment) and reliability.
As Im writing this, i just had a conversation with someone for 20 minutes but only 2 !!!!!!!!!!!!!!! minutes were captured. In that conversation i was the only one who was speaking but our app didn't even assign my name
- [ ] Check why omi didn't capture 80% of this conversation - make sure I understand the problem and then fix it
- [ ] suggest Nik a solution for speaker diarization to make it THE BEST IN THE DAMN WORLD, don't be lazy even if you need to invent smth new
- [ ] make sure that ALMOST EVERY CONVERSATION TITLE CONTAINS NAMES and THEY ARE CORRECT - see granola reference below
do this today pls @beastoin
7 omi users (mobile) and top-1 issue heard was poor quality of diarization (speaker assignment) and reliability
i need all contacts so i can talk to them, especially i need their uid to check the data.
@beastoin device
give me the app version and the firmware version, please
@beastoin 1.0.78 471
12 firmware
Hey , I've been digging into the "poor diarization" and "missing audio" issues (re: Issue #2806). I believe the current reliability issues stem from two linked problems in the pipeline, and I'd like to propose a fix I'm working on:
1. The Root Cause of "Missing Audio" (VAD Gating) The current VAD/Silence thresholds (likely default Silero >0.5) are too aggressive for wearable audio, which often has varying distance-to-mic. This causes the system to treat "quiet" valid speech (like a friend across the table) as silence, dropping ~80% of some conversations.
- Fix: Decouple VAD from the recording trigger. Implement a "soft" VAD with a lower threshold (0.3) and increased
speech_pad_ms(500ms+) to capture the full context, not just loud peaks.
2. State-of-the-Art Diarization (Pyannote) Deepgram's streaming diarization struggles with single-channel wearable audio.
- Fix: Switch to a Global Clustering approach using
pyannote-audio. Instead of diarizing chunk-by-chunk (which forgets context), we generate embeddings for the whole session and cluster them. This fixes the "Speaker 0 / Speaker 1" flip-flopping.
3. "Granola-Tier" Smart Titles To solve the generic "Conversation #24" titles, I propose a post-processing step:
- Pass the first 120s of the diarized transcript to the LLM with a specific prompt: "Extract distinct speaker names and rename this file 'Meeting with NAME'."