RealtimeSTT Feature Request - Wake Word delay - stay open listening after wake word for x seconds

Thanks for a great program/script!

Is there a way, or can a way be added to have a wake word delay? What I mean by this, is a way for the recorder to stay listening without the need for a wake word for X amount of seconds after it detects the wake word in the context of a conversation.

For example, the user uses a wake word to wake the recorder, has a conservation with the LLM for 2-3 minutes, and when the recorder detects silence for X amount of seconds (say 1-2 minutes), it goes back to listing for a wake word?

My use case is this. I have an AI in a public game lobby. Users come up and chat with it for 1-2 minutes, then move on. Right now, the way the wake work works, with each new sentence they say, they need to use a wake work. For example:

Player: Hey SkyNet, who had the most kills on the last game?
SkyNet: The user with the most kills was SkyCandy. She had 21 kills with the MP9
Player: Hey SkyNet, who had the most damage
SkyNet: The user with the most damage was Crypto4x. He had 3000 damage with the AK
Player: Hey Skynet.....

A more intuitive conversation would not require the player to use a wake word for each follow up question. When the conversation is "over" the LLM/recorder goes back to listening for a wake word...

Player: Hey SkyNet, who had the most kills on the last game?
SkyNet: The user with the most kills was SkyCandy. She had 21 kills with the MP9
Player: Who had the most damage
SkyNet: The user with the most damage was Crypto4x. He had 3000 damage with the AK
Player: What is the average speed of an unladen swallow quote?
SkyNet: An African swallow or a European Swallow...
player leaves, silence for 1 minute (recorder is open listening for sound)
Recorder goes back to listening for wake word after 1 minute of silence

Jan 21 '25 15:01 SkyCandy567

The issue is in your implementation.

Cases:

Assuming your voice assistant has no interruption feature (players cant interupt while SkyNet is responding), you dont need to listen while SkyNet is responding. You can wake up the listener with recorder.start() or recorder.wakeup() after the SkyNet stopped playing the audio.
If your voice assistant listens for interruptions (doesnt sound like it) then you would either call recorder.wakeup() on an interval until SkyNet is no longer recording or in the interruption logic.

Edit:

If you are using KoljaB/RealtimeTTS you can use on_audio_stream_stop callback argument.

# Initialize audio player
        self.audio_player = TextToAudioStream(
            engine=self.audio_engine,
            on_audio_stream_stop=self.on_play_stop # on_play_stop is the function that starts the listener 
        )

    def on_play_stop(self):
          text = self.recorder.text().strip() # put this in a seperate thread preferably since you are also recording audio

Jan 22 '25 03:01 johnmalek312

This is what I first tried, however, people constantly talk around the same vicinity as her. Since she is constantly listening, she will pick up conversations that are not directed at her. I played with various noise gates, however, in the video game, it gives everyone about 7 feet from her the same volume level which means she is constantly responding to conversations that have nothing to do with her.

This is when I transitioned to the wake word to keep her from responding to every little conversation. This way, she doesn't respond unless a player is specifically talking to her. This issue doesn't lay with her being interrupted, more so that each subsequent statement/question from a user requires a wake word again.

Jan 22 '25 03:01 SkyCandy567

Can you please try wake_word_activation_delay parameter?

Jan 23 '25 21:01 KoljaB

Thank you!! I think that is what I am looking for. For some reason, I thought that was just when it started up for the very first time 🤦‍♀️

Jan 26 '25 06:01 SkyCandy567

If wake_word_activation_delay = 30.0, does that mean the wake word will be triggered again 30 seconds after it was triggered?

[Explanation]

- wake_word_activation_delay (float, default=0): Duration in seconds after the start of monitoring before the system switches to wake word activation if no voice is initially detected. If set to zero, the system uses wake word activation immediately.

Apr 08 '25 08:04 sangheonEN

I'm curious as to how you trained the wake word model. Does it work well?

Apr 08 '25 08:04 sangheonEN

These are the working modes in RealtimeSTT:

waiting for a wakeword
- when you speak a wakeword the system switches from 1 to 2
waiting for voice activity
- on voice activity it switches from 2 to 3
- if there is no voice activity for the number of seconds specified in wake_word_activation_delay it returns to 1
recording
- waits for voice deactivity until transcribing and then returning to 1

Apr 08 '25 08:04 KoljaB

A more intuitive conversation would not require the player to use a wake word for each follow up question. When the conversation is "over" the LLM/recorder goes back to listening for a wake word...

How can I make that function work in the above? That is, how can I make the wake word be repeated after the procedure of a specific scenario 1-10 is completed.

It's not like I have to use a wake word for every scenario.

Apr 08 '25 09:04 sangheonEN

When the conversation is "over"

Yeah, this is exactly what wake_word_activation_delay is about, define the amount of time that has to pass in silence before we decide "conversation is over" -> wake word mode. I'll post code...

Apr 08 '25 11:04 KoljaB

"""
pip install realtimestt realtimetts[edge]
"""

# Set this to False to start by waiting for a wake word first
# Set this to True to start directly in voice activity mode
START_IN_VOICE_ACTIVITY_MODE = False

if __name__ == '__main__':
    import os
    import openai
    from RealtimeTTS import TextToAudioStream, EdgeEngine
    from RealtimeSTT import AudioToTextRecorder

    # Text-to-Speech Stream Setup (EdgeEngine)
    engine = EdgeEngine(rate=0, pitch=0, volume=0)
    engine.set_voice("en-US-SoniaNeural")
    stream = TextToAudioStream(
        engine,
        log_characters=True
    )

    # Speech-to-Text Recorder Setup
    recorder = AudioToTextRecorder(
        model="medium",
        language="en",
        wake_words="Jarvis",
        spinner=True,
        wake_word_activation_delay=5 if START_IN_VOICE_ACTIVITY_MODE else 0,
    )

    system_prompt_message = {
        'role': 'system',
        'content': 'Answer precise and short with the polite sarcasm of a butler.'
    }

    def generate_response(messages):
        """Generate assistant's response using OpenAI."""
        response_stream = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            stream=True
        )

        for chunk in response_stream:
            text_chunk = chunk.choices[0].delta.content
            if text_chunk:
                yield text_chunk

    history = []
    
    try:
        # Main loop for interaction
        while True:
            if START_IN_VOICE_ACTIVITY_MODE:
                print("Please speak...")
            else:
                print('Say "Jarvis" then speak...')

            user_text = recorder.text().strip()

            # If not starting in voice activity mode, set the delay after the first interaction
            if not START_IN_VOICE_ACTIVITY_MODE:
                recorder.wake_word_activation_delay = 5

            print(f"Transcribed: {user_text}")

            if not user_text:
                continue

            print(f'>>> {user_text}\n<<< ', end="", flush=True)
            history.append({'role': 'user', 'content': user_text})

            # Get assistant response and play it
            assistant_response = generate_response([system_prompt_message] + history[-10:])
            stream.feed(assistant_response).play()

            history.append({'role': 'assistant', 'content': stream.text()})
    except KeyboardInterrupt:
        print("\nKeyboard interrupt detected. Shutting down...")
        recorder.shutdown()

Does that help or make things more clear?

Apr 08 '25 11:04 KoljaB

wake_word_timeout (float, default=5): Duration in seconds after a wake word is recognized. If no subsequent voice activity is detected within this window, the system transitions back to an inactive state, awaiting the next wake word or voice activation.

What does this parameter do?

It seems to be no different from the wake_word_activation_delay parameter..

Apr 11 '25 06:04 sangheonEN