STT: Real-Time Audio Processing with Chunking, and Voice Activity Detection (VAD)
Add first-class support for real-time transcription consisting of
- Audio I/O utilities (load_audio, load_audio_chunk)
- Streaming / buffer management (OnlineASRProcessor or equivalent)
- Voice-Activity Detection integration (VACOnlineASRProcessor)
The goal is a seamless pipeline that can accept live (or pseudo-streamed) audio, detect speech segments on the fly, and hand them to an ASR backend for transcription.
1 . Audio utilities
import librosa
import numpy as np
# Done ✅
def load_audio(path, sr: int = 16_000) -> np.ndarray:
pass
def load_audio_chunk(audio: np.ndarray,
start_s: float,
end_s: float,
sr: int = 16_000) -> np.ndarray:
"""Return samples between `start_s` and `end_s` seconds."""
return audio[int(start_s * sr) : int(end_s * sr)]
2 . Real-time buffer manager
class RealTimeProcessor:
"""Ring-buffer that always keeps ≤ `buffer_limit` seconds."""
def __init__(self, sr: int = 16_000, buffer_limit: float = 15.0):
self.sr = sr
self.buffer_limit = buffer_limit
self.buf = np.empty(0, dtype=np.float32)
def add(self, chunk: np.ndarray) -> None:
self.buf = np.append(self.buf, chunk)
max_samples = int(self.buffer_limit * self.sr)
if self.buf.size > max_samples:
self.buf = self.buf[-max_samples:]
def get(self) -> np.ndarray:
"""Return the current buffer (feed to ASR)."""
return self.buf
3 . Voice-activity detection (Silero VAD)
import torch
class VAD:
def __init__(self):
self.model, _ = torch.hub.load('snakers4/silero-vad', 'silero_vad')
def is_speech(self, chunk: np.ndarray, sr: int = 16_000) -> bool:
"""True if speech is detected in `chunk`."""
return bool(self.model(torch.tensor(chunk), sr))
4 . Orchestration & example run
class TranscriptionPipeline:
"""Glue VAD + rolling buffer + (placeholder) ASR."""
def __init__(self, model_repo, sr: int = 16_000, buffer_limit: float = 15.0):
self.proc = RealTimeProcessor(sr, buffer_limit)
self.vad = VAD()
self.asr = load(model_repo)
def on_chunk(self, chunk: np.ndarray) -> None:
if self.vad.is_speech(chunk):
self.proc.add(chunk)
# ⬇️ Replace the print with a call to Whisper, Parakeet, etc.
print(f"Transcribing {self.proc.get().size} samples …")
else:
print("⏸️ silence")
# ----- demo with 1-second slices ------------------------------------------
audio_file = "example_audio.wav"
sr = 16_000
audio = load_audio(audio_file, sr)
pipeline = TranscriptionPipeline(sr)
for t in np.arange(0, len(audio) / sr, 1.0): # 1-second steps
chunk = load_audio_chunk(audio, t, t + 1.0, sr)
pipeline.on_chunk(chunk)
Next steps
-
Plug in a real ASR backend: swap the
printcall forwhisper_model.transcribe(pipeline.proc.get()). -
Live audio: feed microphone or websocket frames into
on_chunkinstead of slicing a file. - Tune VAD: expose model thresholds or use a custom-trained detector for noisy environments.
This is the minimal, dependency-light foundation you can evolve into a full real-time transcription service.
I see the MLX version of JARVIS is coming to life! How about the name M.A.R.V.I.S., also inspired by J.O.S.I.E. from @Goekdeniz-Guelmez?
Thank you very much @lin72h, I love that idea! ❤️
If JARVIS stands for: Just a Rather Very Intelligent System
I'm thinking MARVIS can stand for either:
- Maybe A Really Very Intelligent Sidekick
- Modular Autonomous Reasoning Versatile Intelligent System
- Merely A Rather Very Intelligent System
What do you think?
How about My Awesome Realtime Vision Intelligence System, so the in the coming 2.0 you can integrate your mlx-vlm into the system to have a true Vision, and Vision is also JARVIS's 2nd name :-)
In Chinese translation: 马维斯 “马维斯” (Mǎ Wéisī) echoes both 马斯克 (Musk) and 贾维斯 (Jarvis). Since Elon Musk is often likened to a real‑life Tony Stark, the name feels spot‑on.**
You have a point, I love it! 👀
VLM is already a dep day-0 but the Vision integration part will come in v0.3.0 at the end of the month.
It can alternate between:
My Awesome Real-time Vision Intelligent System / Very Intelligent System
Depends on whether you are using vision or llm 😎
I can't wait to introduce MARVIS 🙌🏽🔥
Depend on you, I like both! 🙌🏽
Awesome, thank you very much!❤️
MARVIS is on the way