pipecat Performance Issue: Initial Greeting Latency in Telephony Agent Pipeline

pipecat version

0.0.85

Python version

3.13.2

Operating System

Arch Linux

Question

Initial Response Optimization: Are there known patterns for optimizing the very first LLM→TTS→Audio delivery cycle in telephony applications?
Frame Queue Initialization: What could cause delay specifically for the first LLMRunFrame? Is there initialization overhead we can preload?
TTS Cold Start: Are there streaming/chunking optimizations for initial responses with ElevenLabs?

What I've tried

Service warm-up (pre-initializing OpenAI LLM, ElevenLabs TTS, OpenAI STT services)
Switched to gpt-4o-mini for faster LLM responses
Optimized ElevenLabs parameters (model, stability, speed settings)

Context

Application: Real-time telephony agent
Stack: FastAPI WebSocket + Twilio + OpenAI LLM + ElevenLabs TTS
Audio: G.711 μ-law (8kHz sample rate)

Experiencing significant latency (4-5 seconds) specifically for the initial greeting in telephony pipeline. Mid-conversation latency is acceptable (1-2s), but the first response has a substantial delay that impacts user experience when the call is answered.

Pipeline Configuration:

  pipeline = Pipeline([
      transport.input(),         # Twilio WebSocket input
      stt,                       # OpenAI STT (gpt-4o-transcribe)
      transcript.user(),
      context_aggregator.user(),
      llm,                       # OpenAI LLM (gpt-4o-mini)
      tts,                       # ElevenLabs TTS (eleven_flash_v2_5)
      transport.output(),        # Twilio WebSocket output
      transcript.assistant(),
      context_aggregator.assistant(),
  ])

Through detailed logging with custom frame processors, we identified three major gaps during first response processing: frame queue delay (347ms), TTS processing (2.5s), and transport to transcript delay (1.8s).

Oct 31 '25 20:10 heitor-resolvaai

Honestly same problemo man

Nov 01 '25 17:11 tuduun

Yes, same problem

Nov 04 '25 02:11 sphatate

A few questions:

Are you running locally or deployed?
Is this a dial-in or dial-out use case? (I'm assuming that you're dialing in to the bot based on the question.)

I just ran this example deployed to Pipecat Cloud and I get a response time in ~1 second after picking up: https://github.com/pipecat-ai/pipecat-examples/tree/main/twilio-chatbot/inbound

My example on Pipecat Cloud runs with min_agents: 1 which ensures that I have a single warm reserve agent available to response immediately when I dial-in. We strongly recommend running with a warm reserve to avoid pod / process start up times.

Nov 05 '25 18:11 markbackman

Hi Mark! Thanks for getting to us. I am running the agent on localhost using the ngrok webhook, and yes, it is inbound. Would deploying on the cloud speed up the response time? In my pipeline, I am using: model="gpt-4o-transcribe", model="gpt-4o-mini", and model="tts-1". with FastAPIWebsocketTransport. My initialization is 5 seconds, and mid convo is like 6 to 10 seconds too. If it's okay, can I put the Python file somewhere so you can look at it?

# bot.py
#
# Copyright (c) 2025
# SPDX-License-Identifier: BSD 2-Clause License

import datetime
import io
import os
import sys
import wave
from typing import Optional

import aiofiles
from dotenv import load_dotenv
from fastapi import WebSocket
from loguru import logger

from pipecat.observers.loggers.user_bot_latency_log_observer import UserBotLatencyLogObserver
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.processors.audio.audio_buffer_processor import AudioBufferProcessor
from pipecat.serializers.twilio import TwilioFrameSerializer
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.services.openai.stt import OpenAISTTService
from pipecat.services.openai.tts import OpenAITTSService
from pipecat.transports.websocket.fastapi import (
    FastAPIWebsocketParams,
    FastAPIWebsocketTransport,
)

from pipecat.frames.frames import LLMRunFrame

load_dotenv(override=True)

logger.remove()
logger.add(sys.stderr, level="DEBUG")


async def save_audio(server_name: str, audio: bytes, sample_rate: int, num_channels: int):
    if len(audio) == 0:
        logger.info("No audio data to save")
        return

    filename = f"{server_name}_recording_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.wav"
    with io.BytesIO() as buffer:
        with wave.open(buffer, "wb") as wf:
            wf.setsampwidth(2)
            wf.setnchannels(num_channels)
            wf.setframerate(sample_rate)
            wf.writeframes(audio)
        async with aiofiles.open(filename, "wb") as file:
            await file.write(buffer.getvalue())
    logger.info(f"Merged audio saved to {filename}")


async def run_bot(
    websocket_client: WebSocket,
    stream_sid: Optional[str],
    call_sid: Optional[str],
    account_sid: Optional[str],
    testing: bool,
):
    """
    Build and run the Pipecat pipeline for a single WebSocket call.
    """

    # Bi-directional WebSocket transport + Twilio serializer
    transport = FastAPIWebsocketTransport(
        websocket=websocket_client,
        params=FastAPIWebsocketParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
            add_wav_header=False,
            vad_analyzer=SileroVADAnalyzer(),
            serializer=TwilioFrameSerializer(
                stream_sid=stream_sid,
                call_sid=call_sid,
                account_sid=account_sid,
                auth_token=os.getenv("TWILIO_AUTH_TOKEN"),
            ),
        ),
    )

    # Explicit handles
    input_proc = transport.input()
    output_proc = transport.output()

    # Services
    llm = OpenAILLMService(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="gpt-4o-mini",
        generation_params={
            "max_response_tokens": 60,
            "temperature": 0.6,
            "frequency_penalty": 0.0,
            "presence_penalty": 0.0,
        },
    )

    stt = OpenAISTTService(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="gpt-4o-transcribe",
        audio_passthrough=False,
        enable_interim_results=True,
        endpointing_silence_ms=200,
    )

    tts = OpenAITTSService(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="tts-1",
        voice="nova",
    )


    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant named Tasha. "
                "Your output will be converted to audio so don't include special characters in your answers. "
                "Respond with a short short sentence."
            ),
        }
    ]
    context = OpenAILLMContext(messages)
    context_aggregator = llm.create_context_aggregator(context)

    # Record AFTER output so recording never delays playback
    audiobuffer = AudioBufferProcessor()

    # Build pipeline
    pipeline = Pipeline(
        [
            input_proc,                  # Websocket input from client
            stt,                         # Speech-To-Text
            context_aggregator.user(),   # push user messages into context
            llm,                         # LLM
            tts,                         # Text-To-Speech
            output_proc,                 # Websocket output to client
            audiobuffer,                 # record after output
            context_aggregator.assistant(),
        ]
    )

    task = PipelineTask(
        pipeline,
        observers=[UserBotLatencyLogObserver()],
        params=PipelineParams(
            audio_in_sample_rate=8000,
            audio_out_sample_rate=24000,
            allow_interruptions=True,
        ),
    )

    @transport.event_handler("on_client_connected")
    async def on_client_connected(_transport, _client):
        logger.info("🔌 WebSocket connection established")
        await audiobuffer.start_recording()

        # Seed a one-line intro into context
        messages.append({"role": "system", "content": "Please introduce yourself to the user."})

        # IMPORTANT: trigger the LLM with a run frame
        await task.queue_frames([LLMRunFrame()])

    @transport.event_handler("on_client_disconnected")
    async def on_client_disconnected(_transport, _client):
        logger.info("🔌 WebSocket connection closed by client")
        await task.cancel()

    @audiobuffer.event_handler("on_audio_data")
    async def on_audio_data(_buffer, audio, sample_rate, num_channels):
        server_name = f"server_{websocket_client.client.port}"
        await save_audio(server_name, audio, sample_rate, num_channels)

    runner = PipelineRunner(handle_sigint=False, force_gc=True)
    await runner.run(task)

Nov 06 '25 15:11 tuduun