faster-whisper-server icon indicating copy to clipboard operation
faster-whisper-server copied to clipboard

Use FastRTC for Speaches' real-time API

Open freddyaboulton opened this issue 9 months ago • 6 comments

Hello! I am the developer of FastRTC - a python library for streaming audio and video over WebRTC or Websockets. You define a python function (or handler) and you get an automatic FastAPI-compatible WebRTC and Websocket endpoint you can use to stream audio or video in real time!

I was reading the documentation and noticed that the realtime api for peaches was not implemented yet so I think FastRTC can help!

freddyaboulton avatar Mar 05 '25 17:03 freddyaboulton

I saw the video on your website @freddyaboulton , I love the idea of the project. I'm going to try it out.

vqndev avatar Mar 11 '25 23:03 vqndev

Awesome! Let me know how I can help

freddyaboulton avatar Mar 11 '25 23:03 freddyaboulton

I must say you did a great job @freddyaboulton!!! Thank you so much. I took https://github.com/freddyaboulton/fastrtc/tree/main/demo/whisper_realtime and changed it to use openai with the following changes to implement openai api instead of groq:

from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url=os.getenv("OPENAI_API_BASE", "https://url.to.speaches/v1")
)


async def transcribe(audio: tuple[int, np.ndarray], transcript: str):
    response = await client.audio.transcriptions.create(
        file=("audio-file.mp3", audio_to_bytes(audio)),
        model="deepdml/faster-whisper-large-v3-turbo-ct2",
    )
    yield AdditionalOutputs(transcript + "\n" + response.text

One question, do you have example code where the client is not the browser? Like a python client app?

vqndev avatar Mar 12 '25 00:03 vqndev

Hi @vqndev ! Thank you!

It's possible to connect via websockets from any language but I don't have an example yet from python. There is this js code for connecting over websocket in the browser. I think the python code would be simpler since you can skip all the code for playing the audio in the browser (which needs to be improved anyways).

https://github.com/freddyaboulton/fastrtc/blob/66f0a81b76684c5d58761464fb67642891066f93/demo/webrtc_vs_websocket/index.html#L433

tl;dr you can connect via websocet and send/receive audio in base64

freddyaboulton avatar Mar 12 '25 00:03 freddyaboulton

Thanks! I ask because I built my own assistant in Python using OpenWakeWord and was considering implementing FastRTC.

I've been testing FastRTC in the browser, but despite adjusting the SileroVadOptions and AlgoOptions based on the audio section of the FastRTC website, it doesn't handle my use case well.

My use case involves saying, "Hey Jarvis, send a text message to my mom saying I love her, over." Every second, I analyze the last three seconds of audio, transcribe it using Speeches, and check if 'over' is at the end. If detected, I transcribe the entire phrase from "Hey Jarvis" to "over", remove 'over' using regex, and send the cleaned text to my AI assistant (smolagents). Finally, Speeches TTS generates the response.

I built the 'over' feature to prevent the assistant from misinterpreting long pauses as the end of a command, allowing me more time to think before completing my request.

However, when using FastRTC, it sometimes misses the first few words after a pause. For example:

  • "Add eggs and milk to my shopping list" → Transcribed correctly.
  • "Add eggs [500ms - 2s pause] to my shopping list" → Might transcribe as "Add eggs… shopping list".

vqndev avatar Mar 12 '25 12:03 vqndev

I see! There's an easy way to do the opposite, which is to trigger the audio collection when a word or phrase is mentioned, like "hey Jarvis", or "hey computer". https://fastrtc.org/userguide/audio/#reply-on-stopwords

But you can also implement the exact behavior by subclassing ReplyOnPause, you just need to re-implement the determine_pause method. Here is what it would look like based on your description of doing text-to-speech on the last three seconds of audio:

from fastrtc.reply_on_pause import (
    AlgoOptions,
    AppState,
    ModelOptions,
    PauseDetectionModel,
    ReplyFnGenerator,
    ReplyOnPause,
)
from fastrtc.speech_to_text import get_stt_model, stt_for_chunks
from fastrtc.utils import audio_to_float32


class ReplyOnOver(ReplyOnPause):
    def __init__(
        self,
        fn: ReplyFnGenerator,
        startup_fn: Callable | None = None,
        algo_options: AlgoOptions | None = None,
        model_options: ModelOptions | None = None,
        can_interrupt: bool = True,
        expected_layout: Literal["mono", "stereo"] = "mono",
        output_sample_rate: int = 24000,
        output_frame_size: int = 480,
        input_sample_rate: int = 48000,
        model: PauseDetectionModel | None = None,
    ):
        super().__init__(
            fn,
            algo_options=algo_options,
            startup_fn=startup_fn,
            model_options=model_options,
            can_interrupt=can_interrupt,
            expected_layout=expected_layout,
            output_sample_rate=output_sample_rate,
            output_frame_size=output_frame_size,
            input_sample_rate=input_sample_rate,
            model=model,
        )
        self.algo_options.audio_chunk_duration = 3.0
        self.state = AppState()
        self.stt_model = get_stt_model("moonshine/base")

    def over_detected(self, text: str) -> bool:
        return bool(re.search(r"\bover[.,!?]*$", text.lower()))

    def determine_pause(  # type: ignore
        self, audio: np.ndarray, sampling_rate: int, state: AppState
    ) -> bool:
        """Take in the stream, determine if a pause happened"""
        import librosa

        duration = len(audio) / sampling_rate

        if duration >= self.algo_options.audio_chunk_duration:
            audio_f32 = audio_to_float32((sampling_rate, audio))
            audio_rs = librosa.resample(
                audio_f32, orig_sr=sampling_rate, target_sr=16000
            )
            _, chunks = self.model.vad(
                (16000, audio_rs),
                self.model_options,
            )
            text = stt_for_chunks(self.stt_model, (16000, audio_rs), chunks)
            print(f"Text: {text}")
            state.buffer = None
            if self.over_detected(text):
                state.stream = audio
                print("Over detected")
                return True
            state.stream = None
        return False

    def reset(self):
        super().reset()
        self.generator = None
        self.event.clear()
        self.state = AppState()

    def copy(self):
        return ReplyOnOver(
            self.fn,
            self.startup_fn,
            self.algo_options,
            self.model_options,
            self.can_interrupt,
            self.expected_layout,
            self.output_sample_rate,
            self.output_frame_size,
            self.input_sample_rate,
            self.model,
        )

Here is a demo: https://github.com/user-attachments/assets/171ff2b6-dbb3-4760-af26-0775088592a5

I think you can definitely get better results with a better text-to-speech model (moonshine (the default in FastRTC) makes spelling mistakes) and by tweaking the three second buffer. Sometimes the leading phrase and "over" are split across three second chunks.

freddyaboulton avatar Mar 12 '25 16:03 freddyaboulton