omi icon indicating copy to clipboard operation
omi copied to clipboard

Local storage

Open kodjima33 opened this issue 1 year ago • 20 comments

We all know that we need local transcription.

Both Bitalik Buterin and George Hotz said it when trying out our tech.

Creating this issue to aggregate feedback and prepare for the switch gradually

We need to make omi FULLY LOCAL - fully local transcription and get to store all personal information locally on the phone

Might be in React Native (i don't care about the stack)

Bounty is $20k

I will lock it on whoever will show the best MVP

kodjima33 avatar Nov 04 '24 09:11 kodjima33

I agree, also Whisper is the way to go. It's on device performance (I only used it on an iPhone 12 mini, to develop something simple) is truly incredible. It should be downloaded on demand, 'cause including it in the bundle would be a terrible idea.

Ronuhz avatar Nov 05 '24 16:11 Ronuhz

tell me more about your experience with Whisper + iPHone 12 Mini pls @Ronuhz / such as transcripts quality, speeds, battery draining.

beastoin avatar Nov 12 '24 04:11 beastoin

tell me more about your experience with Whisper + iPHone 12 Mini pls @Ronuhz / such as transcripts quality, speeds, battery draining.

Here is a little demo running on an iPhone 12 mini, iOS 18.2 Beta 3, model: Whisper Tiny using the  Neural Engine for both decoding and encoding. The voice is streamed to the model in real time. Everything runs locally.

https://github.com/user-attachments/assets/8e7caf02-67c7-40d0-9217-2b8f7b2d090f

Ronuhz avatar Nov 12 '24 12:11 Ronuhz

In a native app using Swift and SwiftUI it takes about 10-20 minutes to get this implemented using WhisperKit. In Flutter I don't know.

Ronuhz avatar Nov 12 '24 12:11 Ronuhz

Found this with a quick search but it does not support transcribing in real-time

https://pub.dev/packages/whisper_flutter_plus

mdmohsin7 avatar Nov 15 '24 04:11 mdmohsin7

ok let's make this happen

We need to make omi FULLY LOCAL - fully local transcription

Might be in React Native (i don't care about the stack)

Bounty is $20k

I will lock it on whoever will show the best MVP /bounty $20000

kodjima33 avatar Feb 14 '25 09:02 kodjima33

💎 $20,000 bounty • omi

Steps to solve:

  1. Start working: Comment /attempt #1249 with your implementation plan
  2. Submit work: Create a pull request including /claim #1249 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to BasedHardware/omi!

Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🟢 @yuvrajjsingh0 Feb 15, 2025, 10:06:15 PM WIP
🟢 @Ritesh2351235 Feb 17, 2025, 4:53:22 AM WIP
🟢 @skywinder Mar 12, 2025, 9:00:47 PM WIP

algora-pbc[bot] avatar Feb 14 '25 09:02 algora-pbc[bot]

@kodjima33 how to lay hands on the omi hardware?

ayewo avatar Feb 14 '25 09:02 ayewo

/attempt #1249 Hi, if we are doing it on device, I'd suggest using Device's default speech to text functionality as that is Hardware accelerated and optimized for that device. It's available for both iOS and Android, also it can do it in real time. I will make use of speech to text of device. Using whisper is fine, but whisper is an LLM based model and is really big which can bloat the application and using it on low end devices will make the app suffer with crashes. I have previously worked with integrating Tesseract on Android devices natively and from that experience I can say that using whisper locally is never an option as it will only work well on high end devices. @kodjima33 Here's a sample app I created in Flutter and its demo in iOS: https://github.com/user-attachments/assets/6511fc7a-7c15-433e-a8c5-79870658e270

Algora profile Completed bounties Tech Active attempts Options
@yuvrajjsingh0 1 bounty from 1 project
PureBasic
Cancel attempt

yuvrajjsingh0 avatar Feb 15 '25 22:02 yuvrajjsingh0

@yuvrajjsingh0 The problem with using the platform's own STT is that then you won't have Speaker separation. For Whisper Tiny you need less then a GB of VRAM and storage. It should be downloaded on-demand and NOT be included in the bundle. It can be ran on the ANE on Apple Devices at least, sadly I can't speak about Android because it's not my area of expertise.

Ronuhz avatar Feb 16 '25 08:02 Ronuhz

@kodjima33 Okay, if we want to use Whisper, do we need this transcription thing in real-time? Or we'll be doing it on saved audio?

There is an option to use a Voice Recognition model on the voice that will tell us who is speaking at what timeframe and use STT to transcribe it.

yuvrajjsingh0 avatar Feb 16 '25 09:02 yuvrajjsingh0

/attempt #1249 Hey @kodjima33, here is my take on the local transcription for Omi.

Why Whisper Tiny? Mobile-first: Tiny (39M params) is built for edge devices. I ran tests on an iPhone 11 ~150-300ms per audio chunk, no server calls. For Android, TFLite/MediaPipe can handle it, though we’ll need to optimize GPU delegation for weaker devices.

ANE on iOS: WhisperKit (Swift) taps into Apple’s Neural Engine. Battery drain is minimal compared to CPU-only inference. Demo here—got it working in a test app with real-time streaming.

Supports Multiple Languages.

Avoid app bloat: Ship the model (~150MB) via CDN (Hugging Face Hub?) post-install. No need to bake it into the bundle.

Alternatives I tested (and why they suck):

Platform STT (Android/iOS APIs): Pros: Zero latency, free. Cons: No speaker diarization, struggles with accents/background noise. Tried it—accuracy tanks in noisy environments.

Distil-Whisper/Hugging Face models: Smaller, but multilingual support is spotty. Whisper Tiny handles 100+ languages out of the box.

Larger Whisper models (Base/Medium): Overkill. Medium needs ~5GB RAM—not happening on phones.

Implementation Plan iOS: Use WhisperKit (Swift) for ANE-accelerated inference. Wrote a PoC—it’s ~20 lines of Swift to hook into mic input and stream to the model.

Android: Option A: MediaPipe’s TFLite build (C++ → Kotlin/JNI).

Option B: Transformers Android (Java), but might need model quantization.

Speaker Diarization Hack: Whisper doesn’t do this natively. Workaround: Add Silero VAD to detect pauses/speaker changes. Not perfect, but gets us 80% there without cloud calls.

Using Whisper Tiny on the device is possible. The trade-offs are a slightly bigger app size after downloading and some tweaks needed for speaker identification. But it's worth it for better privacy and lower server costs.

@Ronuhz , I saw that you're working on Whisper Tiny. Let me know if you're open to collaborating on this.

Options

Ritesh2351235 avatar Feb 17 '25 04:02 Ritesh2351235

what about using https://github.com/mediar-ai/screenpipe/tree/main/screenpipe-audio

it's pure rust, meaning you can make it mobile friendly easily

Image

louis030195 avatar Feb 17 '25 16:02 louis030195

Hell yeah, I'm totally on the side of Bitalik Buterin and George Hotz (really, they use it?). I'm impressed even more! 🤩

Regarding Whisper: I built a native app for iOS and it works very fast. While it has multilingual capabilities, there are still some challenges with certain language combinations and real-time processing for some use cases.

Alternative Solutions: VOSK:

An offline speech recognition toolkit based on Kaldi

Supports multiple languages (which would be helpful for #1892)

Designed for quick integration into applications

Works completely offline, perfect for privacy-focused applications 🔒

skywinder avatar Mar 10 '25 12:03 skywinder

One thing i wish to add is local deployment of Whisper on iPhone or Macbook doesn't meet the product standard of efficiency, though you maybe able to get it up running.

goodpeter-sun avatar Mar 11 '25 01:03 goodpeter-sun

I'm so looking forward to finally be able to use my DevKit 2, since the current model sucks a** when it comes to Swedish. Good job!

maxfahl avatar Mar 12 '25 13:03 maxfahl

hey guys may be we do a small pause here, cuz we might move to react native

kodjima33 avatar Mar 13 '25 01:03 kodjima33

I'm so looking forward to finally be able to use my DevKit 2, since the current model sucks a** when it comes to Swedish. Good job!

same with Hungarian.

This thing is so promising, I cant wait for language support. <3

(Note from product / ux perspective: While many early adopters speak English, broad language support seems especially important for this product. If a conversation happens in an unsupported language, I can’t expect participants to switch just for sake of OMI (even if they all speak in english). Without broad language support this product risks becoming more of an AI-assisted, "note to self" voice recorder for the unsupported language audiences)

k0-ba avatar Mar 13 '25 20:03 k0-ba

I'm working for banafo, we do on device STT. Would love to partner up. demo here: https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm (python example and weights are linked somewhere there).

joazoa avatar May 10 '25 08:05 joazoa

Encryption is dramatically different from on-device processing, with on-device processing being infinitely preferable. Of course, that might interfere with your subscription model... but perhaps not a ton. People who demand extremely high instant performance could opt to sign up for the subscription. Anyone who wanted to optimize for security and safety could opt to only transcribe locally, which might slow down transcription/summary times and increase battery drain (for mobile).

SyndicatedPillbug avatar Sep 30 '25 15:09 SyndicatedPillbug