agents icon indicating copy to clipboard operation
agents copied to clipboard

Prewarm the LLM before session starts

Open davidzhao opened this issue 4 months ago • 9 comments

Currently (1.2) we prewarm the connection with STT and TTS, but not with the LLMs.

This is because LLMs usually require us to perform an inference request, instead of just opening a HTTP connection to /. There are a bunch of benefits to prewarming the LLM, the primary one is we could bypass the initial connection setup time, which could add up to be ~2s (DNS, SSL roundtrips).

davidzhao avatar Aug 23 '25 06:08 davidzhao

looking for this as well. i am having better results with gpt-5 but the latency is unbearable on the first turn. This becomes more immediate. i am seeing 4+ seconds for first calls vs 1.2 seconds for cached calls.

galigutta avatar Aug 26 '25 18:08 galigutta

@davidzhao where do you prewarm STT and TTS in livekit agents. i guess its not implemented for any plugins. We are seeing very high first speech latency from agent after user message. Can you hep me how you prewarm STT and TTS.

abhismatrix1 avatar Sep 01 '25 15:09 abhismatrix1

@davidzhao Is there a guide how to prewarm LLMs explicitly?

ss14 avatar Sep 30 '25 09:09 ss14

Are there any updates on this?

anlagbr avatar Oct 01 '25 14:10 anlagbr

Hi @davidzhao,

I've been looking into LLM prewarming and wanted to share my findings.

You mentioned that LLMs typically require an inference request for prewarming, unlike STT/TTS which can just open an HTTP connection to /. However, I discovered that for OpenAI and other HTTP-based LLM APIs, we can actually achieve the same connection prewarming benefits without needing to send a full inference request.

For self-hosted LLMs, sending an inference request during prewarm makes sense to "wake up" the model and load it into memory. However, for public LLM services (OpenAI, Anthropic, Google, etc.), the models are already running and serving requests globally. In this case, the primary latency bottleneck is the client-side connection establishment (DNS, TCP, TLS), not the model availability.

I've implemented a prewarm() method for the OpenAI LLM plugin that:

  1. Makes a lightweight GET request to / in a background task
  2. Establishes the HTTP connection (DNS resolution, TCP handshake, TLS negotiation)

Test Results

Image If you believe this approach is correct, I'd be happy to submit a PR for your review! Let me know your thoughts!

Pulkit0729 avatar Nov 04 '25 14:11 Pulkit0729

@Pulkit0729 could you share your code even if dirty?

marctorsoc avatar Nov 07 '25 13:11 marctorsoc

@marctorsoc I have a added a pr where you can see the changes. The changes are only for open ai llm for now and can be added for others as well once the logic is aprooved

Pulkit0729 avatar Nov 07 '25 13:11 Pulkit0729

thanks @Pulkit0729 , I took a look and left a comment. Looks good to me!

marctorsoc avatar Nov 10 '25 16:11 marctorsoc

@Pulkit0729 this is for the openai plugin. How would this go if I use openai via livekit inference? is it worth pre-warming and stick to the plugin vs using livekit inference? (supposed to be more stable, not sure about latency)

@davidzhao @longcw is this in the roadmap / already implemented in 1.3?

marctorsoc avatar Nov 24 '25 11:11 marctorsoc