agents Very High Initial EOU latencies while using Azure STT

We are experiencing a significantly high first-request delay of 4-10 seconds when using Azure Speech-to-Text (Azure-STT) in our Voice-Agent-Pipeline. However, subsequent requests exhibit much lower latency, around 700-750ms.

The agent is currently running our AWS server in India (We have self-hosted livekit) and the agent is connected with telephony using exotel.

We have tried pre-warming the Azure server before dialling to the user by sending in an dummy audio i/p it did not help much in reducing the first EOU latency and we have raised the issue with the Azure Team, any suggestions on how to tackle this will be helpful.

Attaching the STT and VAD configs along with a sample log file for reference.

Azure-configs (Tried with multi-lingual detection as well, but the issue still persists).

    proc.userdata["stt"] = azure.STT( speech_key="<key>", 
                                 speech_region="centralindia", 
                                 language="hi-IN", 
                                 segmentation_silence_timeout_ms=300 )

VAD-configs

    proc.userdata["vad"] = silero.VAD.load(min_silence_duration=0.1, 
                                 prefix_padding_duration=0.3, 
                                 min_speech_duration=0.1, 
                                 max_buffered_speech = 40, 
                                 activation_threshold = 0.4 )

Voice-Agent-Pipeline Configs

agent = VoicePipelineAgent( vad=ctx.proc.userdata["vad"], 
                                 stt=ctx.proc.userdata["stt"], 
                                 llm=google_llm, 
                                 tts=azure_female_tts, 
                                 chat_ctx=initial_ctx, 
                                 fnc_ctx=CallActions(api=ctx.api, participant=participant, room=ctx.room), 
                                 min_endpointing_delay=0.1, 
                                 max_endpointing_delay=0.15 )

Sample-Logs (While running the agent in start-mode).

{"message": "user has picked up", "level": "INFO", "name": "outbound-caller", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:43:50.191702+00:00"}
{"message": "Pipeline EOU metrics: sequence_id=fea6e2011213, end_of_utterance_delay=4.12, transcription_delay=4.12", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:43:55.045604+00:00"} 
{"message": "AFC is enabled with max remote calls: 10.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:43:55.047839+00:00"}
{"message": "AFC remote call 1 is done.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:43:57.142847+00:00"}
{"message": "Pipeline LLM metrics: sequence_id=fea6e2011213, ttft=2.10, input_tokens=839, output_tokens=11, tokens_per_second=4.99", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:43:57.251831+00:00"}
Info: on_underlying_io_bytes_received: Close frame received Info: on_underlying_io_bytes_received: closing underlying io. Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=fea6e2011213, ttfb=0.36122050000085437, audio_duration=2.69", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:43:57.660506+00:00"}
{"message": "Pipeline EOU metrics: sequence_id=5da306e2d01c, end_of_utterance_delay=0.64, transcription_delay=0.63", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:01.970419+00:00"} 
{"message": "AFC is enabled with max remote calls: 10.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:01.971899+00:00"}
{"message": "AFC remote call 1 is done.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:04.025985+00:00"}
{"message": "Pipeline LLM metrics: sequence_id=5da306e2d01c, ttft=2.06, input_tokens=854, output_tokens=25, tokens_per_second=11.50", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:04.144913+00:00"}
Info: on_underlying_io_bytes_received: Close frame received Info: on_underlying_io_bytes_received: closing underlying io. Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=5da306e2d01c, ttfb=0.6690272499999992, audio_duration=6.73", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:05.154018+00:00"}
{"message": "Pipeline EOU metrics: sequence_id=5c8d081591a3, end_of_utterance_delay=0.60, transcription_delay=0.60", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:13.397990+00:00"} 
{"message": "AFC is enabled with max remote calls: 10.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:13.398816+00:00"}
{"message": "AFC remote call 1 is done.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:15.592434+00:00"}
{"message": "Pipeline LLM metrics: sequence_id=5c8d081591a3, ttft=2.20, input_tokens=885, output_tokens=44, tokens_per_second=16.96", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:15.993714+00:00"}
Info: on_underlying_io_close_complete: uws_state: 6. Info: uws_client_close_async: closed underlying io.
{"message": "Pipeline TTS metrics: sequence_id=5c8d081591a3, ttfb=3.5816812089997256, audio_duration=7.75", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:21.222715+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6. 
{"message": "Pipeline TTS metrics: sequence_id=5c8d081591a3, ttfb=3.9616254579996166, audio_duration=2.30", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:25.774709+00:00"}
{"message": "Pipeline EOU metrics: sequence_id=da99722c123d, end_of_utterance_delay=0.62, transcription_delay=0.62", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:33.576190+00:00"}
{"message": "AFC is enabled with max remote calls: 10.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:33.577608+00:00"}
{"message": "AFC remote call 1 is done.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:35.543008+00:00"}
{"message": "Pipeline LLM metrics: sequence_id=da99722c123d, ttft=1.97, input_tokens=936, output_tokens=21, tokens_per_second=9.93", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:35.692191+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=da99722c123d, ttfb=0.41878175000056217, audio_duration=2.28", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:36.131160+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=da99722c123d, ttfb=0.36498150000079477, audio_duration=2.93", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:36.585260+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=da99722c123d, ttfb=0.5957701669995004, audio_duration=2.33", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:37.304937+00:00"}
{"message": "Pipeline EOU metrics: sequence_id=595fb64d0b22, end_of_utterance_delay=0.70, transcription_delay=0.70", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:47.542178+00:00"} 
{"message": "AFC is enabled with max remote calls: 10.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:47.543905+00:00"}
{"message": "AFC remote call 1 is done.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:49.462556+00:00"}
{"message": "Pipeline LLM metrics: sequence_id=595fb64d0b22, ttft=1.92, input_tokens=958, output_tokens=29, tokens_per_second=13.81", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:49.644142+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=595fb64d0b22, ttfb=0.7348645000001852, audio_duration=5.04", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:50.467473+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=595fb64d0b22, ttfb=0.44387512499997683, audio_duration=2.61", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:44:50.984306+00:00"}
{"message": "Pipeline EOU metrics: sequence_id=5b3466823aea, end_of_utterance_delay=0.84, transcription_delay=0.84", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:00.130514+00:00"}
{"message": "AFC is enabled with max remote calls: 10.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:00.131955+00:00"}
{"message": "AFC remote call 1 is done.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:01.981920+00:00"}
{"message": "Pipeline LLM metrics: sequence_id=5b3466823aea, ttft=1.85, input_tokens=994, output_tokens=34, tokens_per_second=15.88", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:02.272808+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6. 
{"message": "Pipeline TTS metrics: sequence_id=5b3466823aea, ttfb=0.6538038329999836, audio_duration=4.85", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:02.960286+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=5b3466823aea, ttfb=0.42698366699914914, audio_duration=2.63", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:03.456047+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=5b3466823aea, ttfb=0.4602830419989914, audio_duration=3.07", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:04.017993+00:00"}
{"message": "Pipeline EOU metrics: sequence_id=f4c709e75e03, end_of_utterance_delay=0.66, transcription_delay=0.66", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:16.397638+00:00"} 
{"message": "AFC is enabled with max remote calls: 10.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:16.399153+00:00"}
{"message": "AFC remote call 1 is done.", "level": "INFO", "name": "google_genai.models", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:18.398228+00:00"}
{"message": "Pipeline LLM metrics: sequence_id=f4c709e75e03, ttft=2.00, input_tokens=1031, output_tokens=18, tokens_per_second=8.48", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:18.521107+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=f4c709e75e03, ttfb=0.43270599999959813, audio_duration=2.17", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:18.967886+00:00"}
Info: on_underlying_io_bytes_received: Close frame received
Info: on_underlying_io_bytes_received: closing underlying io.
Info: on_underlying_io_close_complete: uws_state: 6.
{"message": "Pipeline TTS metrics: sequence_id=f4c709e75e03, ttfb=0.40327341699958197, audio_duration=1.69", "level": "INFO", "name": "livekit.agents", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:19.432806+00:00"}
{"message": "ending the call for phone_user", "level": "INFO", "name": "outbound-caller", "pid": 91684, "job_id": "AJ_vULzZB5q72qJ", "timestamp": "2025-03-23T07:45:22.784387+00:00"}

Apr 05 '25 07:04 Jeeva-MV

Has this been fixed by them or some fix worked out for you? @Jeeva-MV

Apr 08 '25 17:04 tanmaydesai89

Has this been fixed by them or some fix worked out for you? @Jeeva-MV

No it's not yet fixed, we are retrying now with the latest version upgrades.

Apr 09 '25 05:04 Jeeva-MV

we are digging into this one. Azure should have really fast inference times.

May 06 '25 20:05 davidzhao

Any updates ? I am also experiencing this issue using Google STT:

[2025-05-26 00:28:00.798046] User state changed: listening -> speaking
[2025-05-26 00:28:03.105975] User state changed: speaking -> listening
I0000 00:00:1748212086.113162 70241269 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1748212086.406600 70241269 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
2025-05-26 00:28:06,676 - WARNING livekit.agents - Running <Task pending name='recognize' ... > took too long: 3.57 seconds {"pid": 47254, "job_id": "AJ_EYxsFmA2Dw8s"}
2025-05-26 00:28:06,779 - DEBUG urllib3.connectionpool - Starting new HTTPS connection (1): oauth2.googleapis.com:443 {"pid": 47254, "job_id": "AJ_EYxsFmA2Dw8s"}
2025-05-26 00:28:06,866 - DEBUG urllib3.connectionpool - https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None {"pid": 47254, "job_id": "AJ_EYxsFmA2Dw8s"}
[2025-05-26 00:28:08.039665] User input transcribed: Oui, bonjour. Je voudrais prendre un rendez-vous, s'il vous plaît., final: True
2025-05-26 00:28:08,038 - INFO livekit.agents - STT metrics: audio_duration=2.87 {"pid": 47254, "job_id": "AJ_EYxsFmA2Dw8s"}
2025-05-26 00:28:08,039 - DEBUG livekit.agents - received user transcript {"user_transcript": "Oui, bonjour. Je voudrais prendre un rendez-vous, s'il vous pla\u00eet.", "language": "fr-FR", "pid": 47254, "job_id": "AJ_EYxsFmA2Dw8s"}
NOT_GIVEN
2025-05-26 00:28:08,043 - INFO livekit.agents - EOU metrics: end_of_utterance_delay=5.51, transcription_delay=5.51 {"pid": 47254, "job_id": "AJ_EYxsFmA2Dw8s"}

May 25 '25 22:05 aurelien-ldp

I am also experiencing this issue with other STT services, including self-hosted models. I tried warming up these services with initial requests, but it doesn't seem to work. The initial EOU latencies are too high. I have the same experience with TTS services. Have you guys found a solution to this issue? @aurelien-ldp @Jeeva-MV

Jul 15 '25 04:07 ngoanpv

We are experiencing the same.

Aug 03 '25 20:08 random-checkin

Gently bringing this to notice of @tanmaydesai89 and @Jeeva-MV

Aug 05 '25 07:08 random-checkin

It might be related to the auto language detection and before it was always enabled even only one language is specified. It was fixed in https://github.com/livekit/agents/pull/2959 after agents 1.2.1, can you try the latest version and see how it works?

Aug 05 '25 07:08 longcw

Thanks @longcw - same issue with single as well as multi-lingual setting. First turn EOU is high.

If you test Azure STT without livekit, this issue doesn't appear.

Aug 06 '25 01:08 random-checkin

@random-checkin which version are you using, can you share the output of pip list | grep livekit

Aug 06 '25 02:08 longcw