Local storage
We all know that we need local transcription.
Both Bitalik Buterin and George Hotz said it when trying out our tech.
Creating this issue to aggregate feedback and prepare for the switch gradually
We need to make omi FULLY LOCAL - fully local transcription and get to store all personal information locally on the phone
Might be in React Native (i don't care about the stack)
Bounty is $20k
I will lock it on whoever will show the best MVP
I agree, also Whisper is the way to go. It's on device performance (I only used it on an iPhone 12 mini, to develop something simple) is truly incredible. It should be downloaded on demand, 'cause including it in the bundle would be a terrible idea.
tell me more about your experience with Whisper + iPHone 12 Mini pls @Ronuhz / such as transcripts quality, speeds, battery draining.
tell me more about your experience with Whisper + iPHone 12 Mini pls @Ronuhz / such as transcripts quality, speeds, battery draining.
Here is a little demo running on an iPhone 12 mini, iOS 18.2 Beta 3, model: Whisper Tiny using the Neural Engine for both decoding and encoding. The voice is streamed to the model in real time. Everything runs locally.
https://github.com/user-attachments/assets/8e7caf02-67c7-40d0-9217-2b8f7b2d090f
In a native app using Swift and SwiftUI it takes about 10-20 minutes to get this implemented using WhisperKit. In Flutter I don't know.
Found this with a quick search but it does not support transcribing in real-time
https://pub.dev/packages/whisper_flutter_plus
ok let's make this happen
We need to make omi FULLY LOCAL - fully local transcription
Might be in React Native (i don't care about the stack)
Bounty is $20k
I will lock it on whoever will show the best MVP /bounty $20000
💎 $20,000 bounty • omi
Steps to solve:
- Start working: Comment
/attempt #1249with your implementation plan - Submit work: Create a pull request including
/claim #1249in the PR body to claim the bounty - Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts
Thank you for contributing to BasedHardware/omi!
Add a bounty • Share on socials
| Attempt | Started (GMT+0) | Solution |
|---|---|---|
| 🟢 @yuvrajjsingh0 | Feb 15, 2025, 10:06:15 PM | WIP |
| 🟢 @Ritesh2351235 | Feb 17, 2025, 4:53:22 AM | WIP |
| 🟢 @skywinder | Mar 12, 2025, 9:00:47 PM | WIP |
@kodjima33 how to lay hands on the omi hardware?
/attempt #1249 Hi, if we are doing it on device, I'd suggest using Device's default speech to text functionality as that is Hardware accelerated and optimized for that device. It's available for both iOS and Android, also it can do it in real time. I will make use of speech to text of device. Using whisper is fine, but whisper is an LLM based model and is really big which can bloat the application and using it on low end devices will make the app suffer with crashes. I have previously worked with integrating Tesseract on Android devices natively and from that experience I can say that using whisper locally is never an option as it will only work well on high end devices. @kodjima33 Here's a sample app I created in Flutter and its demo in iOS: https://github.com/user-attachments/assets/6511fc7a-7c15-433e-a8c5-79870658e270
| Algora profile | Completed bounties | Tech | Active attempts | Options |
|---|---|---|---|---|
| @yuvrajjsingh0 | 1 bounty from 1 project | PureBasic |
Cancel attempt |
@yuvrajjsingh0 The problem with using the platform's own STT is that then you won't have Speaker separation. For Whisper Tiny you need less then a GB of VRAM and storage. It should be downloaded on-demand and NOT be included in the bundle. It can be ran on the ANE on Apple Devices at least, sadly I can't speak about Android because it's not my area of expertise.
@kodjima33 Okay, if we want to use Whisper, do we need this transcription thing in real-time? Or we'll be doing it on saved audio?
There is an option to use a Voice Recognition model on the voice that will tell us who is speaking at what timeframe and use STT to transcribe it.
/attempt #1249 Hey @kodjima33, here is my take on the local transcription for Omi.
Why Whisper Tiny? Mobile-first: Tiny (39M params) is built for edge devices. I ran tests on an iPhone 11 ~150-300ms per audio chunk, no server calls. For Android, TFLite/MediaPipe can handle it, though we’ll need to optimize GPU delegation for weaker devices.
ANE on iOS: WhisperKit (Swift) taps into Apple’s Neural Engine. Battery drain is minimal compared to CPU-only inference. Demo here—got it working in a test app with real-time streaming.
Supports Multiple Languages.
Avoid app bloat: Ship the model (~150MB) via CDN (Hugging Face Hub?) post-install. No need to bake it into the bundle.
Alternatives I tested (and why they suck):
Platform STT (Android/iOS APIs): Pros: Zero latency, free. Cons: No speaker diarization, struggles with accents/background noise. Tried it—accuracy tanks in noisy environments.
Distil-Whisper/Hugging Face models: Smaller, but multilingual support is spotty. Whisper Tiny handles 100+ languages out of the box.
Larger Whisper models (Base/Medium): Overkill. Medium needs ~5GB RAM—not happening on phones.
Implementation Plan iOS: Use WhisperKit (Swift) for ANE-accelerated inference. Wrote a PoC—it’s ~20 lines of Swift to hook into mic input and stream to the model.
Android: Option A: MediaPipe’s TFLite build (C++ → Kotlin/JNI).
Option B: Transformers Android (Java), but might need model quantization.
Speaker Diarization Hack: Whisper doesn’t do this natively. Workaround: Add Silero VAD to detect pauses/speaker changes. Not perfect, but gets us 80% there without cloud calls.
Using Whisper Tiny on the device is possible. The trade-offs are a slightly bigger app size after downloading and some tweaks needed for speaker identification. But it's worth it for better privacy and lower server costs.
@Ronuhz , I saw that you're working on Whisper Tiny. Let me know if you're open to collaborating on this.
Options
what about using https://github.com/mediar-ai/screenpipe/tree/main/screenpipe-audio
it's pure rust, meaning you can make it mobile friendly easily
Hell yeah, I'm totally on the side of Bitalik Buterin and George Hotz (really, they use it?). I'm impressed even more! 🤩
Regarding Whisper: I built a native app for iOS and it works very fast. While it has multilingual capabilities, there are still some challenges with certain language combinations and real-time processing for some use cases.
Alternative Solutions: VOSK:
An offline speech recognition toolkit based on Kaldi
Supports multiple languages (which would be helpful for #1892)
Designed for quick integration into applications
Works completely offline, perfect for privacy-focused applications 🔒
One thing i wish to add is local deployment of Whisper on iPhone or Macbook doesn't meet the product standard of efficiency, though you maybe able to get it up running.
I'm so looking forward to finally be able to use my DevKit 2, since the current model sucks a** when it comes to Swedish. Good job!
hey guys may be we do a small pause here, cuz we might move to react native
I'm so looking forward to finally be able to use my DevKit 2, since the current model sucks a** when it comes to Swedish. Good job!
same with Hungarian.
This thing is so promising, I cant wait for language support. <3
(Note from product / ux perspective: While many early adopters speak English, broad language support seems especially important for this product. If a conversation happens in an unsupported language, I can’t expect participants to switch just for sake of OMI (even if they all speak in english). Without broad language support this product risks becoming more of an AI-assisted, "note to self" voice recorder for the unsupported language audiences)
I'm working for banafo, we do on device STT. Would love to partner up. demo here: https://huggingface.co/spaces/Banafo/Kroko-Streaming-ASR-Wasm (python example and weights are linked somewhere there).
Encryption is dramatically different from on-device processing, with on-device processing being infinitely preferable. Of course, that might interfere with your subscription model... but perhaps not a ton. People who demand extremely high instant performance could opt to sign up for the subscription. Anyone who wanted to optimize for security and safety could opt to only transcribe locally, which might slow down transcription/summary times and increase battery drain (for mobile).