Use FastRTC for Speaches' real-time API
Hello! I am the developer of FastRTC - a python library for streaming audio and video over WebRTC or Websockets. You define a python function (or handler) and you get an automatic FastAPI-compatible WebRTC and Websocket endpoint you can use to stream audio or video in real time!
I was reading the documentation and noticed that the realtime api for peaches was not implemented yet so I think FastRTC can help!
I saw the video on your website @freddyaboulton , I love the idea of the project. I'm going to try it out.
Awesome! Let me know how I can help
I must say you did a great job @freddyaboulton!!! Thank you so much. I took https://github.com/freddyaboulton/fastrtc/tree/main/demo/whisper_realtime and changed it to use openai with the following changes to implement openai api instead of groq:
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url=os.getenv("OPENAI_API_BASE", "https://url.to.speaches/v1")
)
async def transcribe(audio: tuple[int, np.ndarray], transcript: str):
response = await client.audio.transcriptions.create(
file=("audio-file.mp3", audio_to_bytes(audio)),
model="deepdml/faster-whisper-large-v3-turbo-ct2",
)
yield AdditionalOutputs(transcript + "\n" + response.text
One question, do you have example code where the client is not the browser? Like a python client app?
Hi @vqndev ! Thank you!
It's possible to connect via websockets from any language but I don't have an example yet from python. There is this js code for connecting over websocket in the browser. I think the python code would be simpler since you can skip all the code for playing the audio in the browser (which needs to be improved anyways).
https://github.com/freddyaboulton/fastrtc/blob/66f0a81b76684c5d58761464fb67642891066f93/demo/webrtc_vs_websocket/index.html#L433
tl;dr you can connect via websocet and send/receive audio in base64
Thanks! I ask because I built my own assistant in Python using OpenWakeWord and was considering implementing FastRTC.
I've been testing FastRTC in the browser, but despite adjusting the SileroVadOptions and AlgoOptions based on the audio section of the FastRTC website, it doesn't handle my use case well.
My use case involves saying, "Hey Jarvis, send a text message to my mom saying I love her, over." Every second, I analyze the last three seconds of audio, transcribe it using Speeches, and check if 'over' is at the end. If detected, I transcribe the entire phrase from "Hey Jarvis" to "over", remove 'over' using regex, and send the cleaned text to my AI assistant (smolagents). Finally, Speeches TTS generates the response.
I built the 'over' feature to prevent the assistant from misinterpreting long pauses as the end of a command, allowing me more time to think before completing my request.
However, when using FastRTC, it sometimes misses the first few words after a pause. For example:
- "Add eggs and milk to my shopping list" → Transcribed correctly.
- "Add eggs [500ms - 2s pause] to my shopping list" → Might transcribe as "Add eggs… shopping list".
I see! There's an easy way to do the opposite, which is to trigger the audio collection when a word or phrase is mentioned, like "hey Jarvis", or "hey computer". https://fastrtc.org/userguide/audio/#reply-on-stopwords
But you can also implement the exact behavior by subclassing ReplyOnPause, you just need to re-implement the determine_pause method. Here is what it would look like based on your description of doing text-to-speech on the last three seconds of audio:
from fastrtc.reply_on_pause import (
AlgoOptions,
AppState,
ModelOptions,
PauseDetectionModel,
ReplyFnGenerator,
ReplyOnPause,
)
from fastrtc.speech_to_text import get_stt_model, stt_for_chunks
from fastrtc.utils import audio_to_float32
class ReplyOnOver(ReplyOnPause):
def __init__(
self,
fn: ReplyFnGenerator,
startup_fn: Callable | None = None,
algo_options: AlgoOptions | None = None,
model_options: ModelOptions | None = None,
can_interrupt: bool = True,
expected_layout: Literal["mono", "stereo"] = "mono",
output_sample_rate: int = 24000,
output_frame_size: int = 480,
input_sample_rate: int = 48000,
model: PauseDetectionModel | None = None,
):
super().__init__(
fn,
algo_options=algo_options,
startup_fn=startup_fn,
model_options=model_options,
can_interrupt=can_interrupt,
expected_layout=expected_layout,
output_sample_rate=output_sample_rate,
output_frame_size=output_frame_size,
input_sample_rate=input_sample_rate,
model=model,
)
self.algo_options.audio_chunk_duration = 3.0
self.state = AppState()
self.stt_model = get_stt_model("moonshine/base")
def over_detected(self, text: str) -> bool:
return bool(re.search(r"\bover[.,!?]*$", text.lower()))
def determine_pause( # type: ignore
self, audio: np.ndarray, sampling_rate: int, state: AppState
) -> bool:
"""Take in the stream, determine if a pause happened"""
import librosa
duration = len(audio) / sampling_rate
if duration >= self.algo_options.audio_chunk_duration:
audio_f32 = audio_to_float32((sampling_rate, audio))
audio_rs = librosa.resample(
audio_f32, orig_sr=sampling_rate, target_sr=16000
)
_, chunks = self.model.vad(
(16000, audio_rs),
self.model_options,
)
text = stt_for_chunks(self.stt_model, (16000, audio_rs), chunks)
print(f"Text: {text}")
state.buffer = None
if self.over_detected(text):
state.stream = audio
print("Over detected")
return True
state.stream = None
return False
def reset(self):
super().reset()
self.generator = None
self.event.clear()
self.state = AppState()
def copy(self):
return ReplyOnOver(
self.fn,
self.startup_fn,
self.algo_options,
self.model_options,
self.can_interrupt,
self.expected_layout,
self.output_sample_rate,
self.output_frame_size,
self.input_sample_rate,
self.model,
)
Here is a demo: https://github.com/user-attachments/assets/171ff2b6-dbb3-4760-af26-0775088592a5
I think you can definitely get better results with a better text-to-speech model (moonshine (the default in FastRTC) makes spelling mistakes) and by tweaking the three second buffer. Sometimes the leading phrase and "over" are split across three second chunks.