mlx-audio icon indicating copy to clipboard operation
mlx-audio copied to clipboard

STT: Real-Time Audio Processing with Chunking, and Voice Activity Detection (VAD)

Open Blaizzy opened this issue 9 months ago • 8 comments

Add first-class support for real-time transcription consisting of

  1. Audio I/O utilities (load_audio, load_audio_chunk)
  2. Streaming / buffer management (OnlineASRProcessor or equivalent)
  3. Voice-Activity Detection integration (VACOnlineASRProcessor)

The goal is a seamless pipeline that can accept live (or pseudo-streamed) audio, detect speech segments on the fly, and hand them to an ASR backend for transcription.


1 . Audio utilities

import librosa
import numpy as np

# Done ✅
def load_audio(path, sr: int = 16_000) -> np.ndarray:
    pass

def load_audio_chunk(audio: np.ndarray,
                     start_s: float,
                     end_s: float,
                     sr: int = 16_000) -> np.ndarray:
    """Return samples between `start_s` and `end_s` seconds."""
    return audio[int(start_s * sr) : int(end_s * sr)]

2 . Real-time buffer manager

class RealTimeProcessor:
    """Ring-buffer that always keeps ≤ `buffer_limit` seconds."""
    def __init__(self, sr: int = 16_000, buffer_limit: float = 15.0):
        self.sr          = sr
        self.buffer_limit = buffer_limit
        self.buf         = np.empty(0, dtype=np.float32)

    def add(self, chunk: np.ndarray) -> None:
        self.buf = np.append(self.buf, chunk)
        max_samples = int(self.buffer_limit * self.sr)
        if self.buf.size > max_samples:
            self.buf = self.buf[-max_samples:]

    def get(self) -> np.ndarray:
        """Return the current buffer (feed to ASR)."""
        return self.buf

3 . Voice-activity detection (Silero VAD)

import torch

class VAD:
    def __init__(self):
        self.model, _ = torch.hub.load('snakers4/silero-vad', 'silero_vad')

    def is_speech(self, chunk: np.ndarray, sr: int = 16_000) -> bool:
        """True if speech is detected in `chunk`."""
        return bool(self.model(torch.tensor(chunk), sr))

4 . Orchestration & example run

class TranscriptionPipeline:
    """Glue VAD + rolling buffer + (placeholder) ASR."""
    def __init__(self, model_repo, sr: int = 16_000, buffer_limit: float = 15.0):
        self.proc = RealTimeProcessor(sr, buffer_limit)
        self.vad  = VAD()
        self.asr = load(model_repo)

    def on_chunk(self, chunk: np.ndarray) -> None:
        if self.vad.is_speech(chunk):
            self.proc.add(chunk)
            # ⬇️ Replace the print with a call to Whisper, Parakeet, etc.
            print(f"Transcribing {self.proc.get().size} samples …")
        else:
            print("⏸️ silence")

# ----- demo with 1-second slices ------------------------------------------
audio_file = "example_audio.wav"
sr         = 16_000
audio      = load_audio(audio_file, sr)

pipeline = TranscriptionPipeline(sr)
for t in np.arange(0, len(audio) / sr, 1.0):          # 1-second steps
    chunk = load_audio_chunk(audio, t, t + 1.0, sr)
    pipeline.on_chunk(chunk)

Next steps

  • Plug in a real ASR backend: swap the print call for whisper_model.transcribe(pipeline.proc.get()).
  • Live audio: feed microphone or websocket frames into on_chunk instead of slicing a file.
  • Tune VAD: expose model thresholds or use a custom-trained detector for noisy environments.

This is the minimal, dependency-light foundation you can evolve into a full real-time transcription service.

Blaizzy avatar May 09 '25 14:05 Blaizzy

I see the MLX version of JARVIS is coming to life! How about the name M.A.R.V.I.S., also inspired by J.O.S.I.E. from @Goekdeniz-Guelmez?

lin72h avatar May 11 '25 02:05 lin72h

Thank you very much @lin72h, I love that idea! ❤️

If JARVIS stands for: Just a Rather Very Intelligent System

I'm thinking MARVIS can stand for either:

  1. Maybe A Really Very Intelligent Sidekick
  2. Modular Autonomous Reasoning Versatile Intelligent System
  3. Merely A Rather Very Intelligent System

What do you think?

Blaizzy avatar May 11 '25 07:05 Blaizzy

How about My Awesome Realtime Vision Intelligence System, so the in the coming 2.0 you can integrate your mlx-vlm into the system to have a true Vision, and Vision is also JARVIS's 2nd name :-)

In Chinese translation: 马维斯 “马维斯” (Mǎ Wéisī) echoes both 马斯克 (Musk) and 贾维斯 (Jarvis). Since Elon Musk is often likened to a real‑life Tony Stark, the name feels spot‑on.**

lin72h avatar May 11 '25 09:05 lin72h

You have a point, I love it! 👀

VLM is already a dep day-0 but the Vision integration part will come in v0.3.0 at the end of the month.

Blaizzy avatar May 11 '25 10:05 Blaizzy

It can alternate between:

My Awesome Real-time Vision Intelligent System / Very Intelligent System

Depends on whether you are using vision or llm 😎

Blaizzy avatar May 11 '25 10:05 Blaizzy

I can't wait to introduce MARVIS 🙌🏽🔥

Blaizzy avatar May 11 '25 10:05 Blaizzy

Depend on you, I like both! 🙌🏽

lin72h avatar May 11 '25 10:05 lin72h

Awesome, thank you very much!❤️

MARVIS is on the way

Blaizzy avatar May 11 '25 13:05 Blaizzy