RealtimeSTT icon indicating copy to clipboard operation
RealtimeSTT copied to clipboard

VAD Events (on_vad_detect_start/on_vad_detect_stop) Are Triggered in Reverse Order

Open v-crn opened this issue 7 months ago • 6 comments

Description

When using AudioToTextRecorder from RealtimeSTT, the Voice Activity Detection (VAD) events on_vad_detect_start and on_vad_detect_stop are triggered in reverse order compared to expected behavior:

  • on_vad_detect_start fires when there is no speech (e.g., immediately after recording starts or after speech).
  • on_vad_detect_stop fires when speech begins.

This inversion disrupts applications relying on accurate VAD event timing, such as voice activity monitoring or real-time transcription systems. The issue was observed consistently in a minimal test case, with logs showing VAD Start during silence and VAD Stop coinciding with speech onset.

Additionally, the initialization logs indicate multiple ALSA-related errors (e.g., "cannot find card '0'"), which may suggest underlying audio device detection issues, though these do not seem to directly cause the VAD inversion.

Steps to Reproduce

  1. Set up a Python environment with RealtimeSTT installed.
  2. Run the provided reproduction code (see below). Start the script without speaking initially, then begin speaking after a few seconds. Observe the output logs for VAD event timings.

Reproduction Code

import asyncio
import time

from RealtimeSTT import AudioToTextRecorder


# Voice Activity Detection (VAD) start handler
def on_vad_detect_start():
    print(f"VAD Start detected at {time.time():.2f}")


# Voice Activity Detection (VAD) stop handler
def on_vad_detect_stop():
    print(f"VAD Stop detected at {time.time():.2f}")


# Transcription completion handler
def on_transcription_finished(text):
    print(f"Transcribed text: {text}")


async def run_recording(recorder):
    # Start recording and process audio in a loop
    print("Starting recording...")
    while True:
        # Use text() to process audio and get transcription
        recorder.text(on_transcription_finished=on_transcription_finished)
        await asyncio.sleep(0.1)  # Prevent tight loop


async def main():
    # Initialize AudioToTextRecorder with VAD event handlers
    recorder = AudioToTextRecorder(
        # model="deepdml/faster-whisper-large-v3-turbo-ct2",
        spinner=False,
        on_vad_detect_start=on_vad_detect_start,
        on_vad_detect_stop=on_vad_detect_stop,
    )

    # Start recording task in a separate thread
    recording_task = asyncio.create_task(run_recording(recorder))

    # Run for 20 seconds to observe VAD events
    await asyncio.sleep(20)

    # Stop recording and shutdown
    print("Stopping recording...")
    recorder.stop()
    recorder.shutdown()

    # Cancel and wait for the recording task to complete
    recording_task.cancel()
    try:
        await recording_task
    except asyncio.CancelledError:
        pass


if __name__ == "__main__":
    asyncio.run(main())

Actual Behavior

The logs from running the script show:

Starting recording...
VAD Start detected at 1742455350.84  # No speech (silence)
VAD Stop detected at 1742455356.66   # Speech starts
Transcribed text: Hello.
VAD Start detected at 1742455359.15  # Speech ends
VAD Stop detected at 1742455362.37   # Speech starts
Transcribed text: 1,2,3,4.
VAD Start detected at 1742455365.79  # Speech ends
VAD Stop detected at 1742455367.41   # Speech starts
Transcribed text: 6 7.
VAD Start detected at 1742455369.68  # Speech ends
VAD Stop detected at 1742455373.54   # Speech starts
Transcribed text: 8,9,10.
Stopping recording...
  • VAD Start consistently triggers during silent (e.g., at recording start or after speech).
  • VAD Stop triggers when speech is detected, as evidenced by the subsequent transcription outputs.

Expected

  • on_vad_detect_start should trigger when speech begins.
  • on_vad_detect_stop should trigger when speech ends.
  • During silence (e.g., at recording start or between utterances), no VAD events should occur unless speech is detected.

For example:

Starting recording...
VAD Start detected at 1742455356.66   # Speech starts
VAD Stop detected at 1742455357.00    # Speech ends
Transcribed text: Hello.
VAD Start detected at 1742455362.37   # Speech starts
VAD Stop detected at 1742455363.00    # Speech ends
Transcribed text: 1,2,3,4.

Environment

  • Python Version: 3.12
  • RealtimeSTT Version: 0.3.94
  • OS: Ubuntu 24.04.1 LTS via WSL on Windows 11
  • Audio Setup: Default system microphone (ALSA errors suggest potential device detection issues)

Additional Logs

The script also outputs numerous ALSA errors during initialization:

ALSA lib confmisc.c:855:(parse_card) cannot find card '0'
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_card_inum returned error: No such file or directory
...
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM sysdefault
...
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started

While these may indicate audio device configuration issues, they do not appear to directly cause the VAD inversion, as transcription still works.

v-crn avatar Mar 20 '25 08:03 v-crn

Thank you for the detailed report, I'll look into this and fix it asap

KoljaB avatar Mar 20 '25 10:03 KoljaB

Should be fixed now in v0.3.99

KoljaB avatar Mar 21 '25 22:03 KoljaB

Thank you for your support in addressing this.

I misunderstood the specification. on_vad_detect_start is triggered when the system is ready to detect voice activity, not when voice activity is actually detected. I apologize for that.

However, it seems that another issue has occurred ~after the changes in v0.3.99~.

When running the reproduction code above, the process does not terminate even after the specified sleep time has elapsed. It now only terminates after on_vad_detect_stop is called and transcription completed.

Starting recording...
VAD Start detected at 1742631572.14  # Silence
(Exceeding the sleep time)
(Say 'hello')
VAD Stop detected at 1742631591.64
Transcribed text: Hello.Stopping recording...

v-crn avatar Mar 22 '25 08:03 v-crn

The behavior has been unchanged since before that update. I’m not sure if this is an issue that needs to be fixed or not.

v-crn avatar Mar 22 '25 09:03 v-crn

I misunderstood the specification. on_vad_detect_start is triggered when the system is ready to detect voice activity, not when voice activity is actually detected.

Oh I also overlooked that. I'll add another two callbacks for actual VAD so we have both situations covered.

When running the reproduction code above, the process does not terminate.

The core problem is that recorder.text() is holding up your script because it's a blocking operation.

What's happening is:

  • run_recording task gets started
  • there recorder.text() gets called, which blocks and waits for audio input
  • now the event loop is stuck until you speak and the transcription completes
  • only after transcription finishes it can run await asyncio.sleep(20)

I suggest something like this:

import concurrent.futures
import asyncio
import time

from RealtimeSTT import AudioToTextRecorder


# Voice Activity Detection (VAD) start handler
def on_vad_start():
    print(f"VAD Start detected at {time.time():.2f}")


# Voice Activity Detection (VAD) stop handler
def on_vad_stop():
    print(f"VAD Stop detected at {time.time():.2f}")


# Transcription completion handler
def on_transcription_finished(text):
    print(f"Transcribed text: {text}")


async def run_recording(recorder):
    print("Starting recording...")
    with concurrent.futures.ThreadPoolExecutor() as pool:
        while True:
            await asyncio.get_event_loop().run_in_executor(
                pool, 
                lambda: recorder.text(on_transcription_finished=on_transcription_finished)
            )
            await asyncio.sleep(0.01)

async def main():
    # Initialize AudioToTextRecorder with VAD event handlers
    recorder = AudioToTextRecorder(
        # model="deepdml/faster-whisper-large-v3-turbo-ct2",
        spinner=False,
        on_vad_start=on_vad_start,
        on_vad_stop=on_vad_stop,
    )

    # Start recording task in a separate thread
    recording_task = asyncio.create_task(run_recording(recorder))

    # Run for 20 seconds to observe VAD events
    await asyncio.sleep(1)
    for i in range(20, 0, -1):
        print(f"Waiting... {i} sec")
        await asyncio.sleep(1)

    # Stop recording and shutdown
    print("Stopping recording...")
    recorder.stop()
    recorder.shutdown()

    # Cancel and wait for the recording task to complete
    recording_task.cancel()
    try:
        await recording_task
    except asyncio.CancelledError:
        pass


if __name__ == "__main__":
    asyncio.run(main())

KoljaB avatar Mar 22 '25 10:03 KoljaB

async def run_recording(recorder):
    print("Starting recording...")
    while True:
        await asyncio.to_thread(
            recorder.text,
            on_transcription_finished=on_transcription_finished
        )
        await asyncio.sleep(0.01)

Btw this would also work with Python 3.9+ and is prob cleaner

KoljaB avatar Mar 22 '25 10:03 KoljaB