Prewarm the LLM before session starts
Currently (1.2) we prewarm the connection with STT and TTS, but not with the LLMs.
This is because LLMs usually require us to perform an inference request, instead of just opening a HTTP connection to /. There are a bunch of benefits to prewarming the LLM, the primary one is we could bypass the initial connection setup time, which could add up to be ~2s (DNS, SSL roundtrips).
looking for this as well. i am having better results with gpt-5 but the latency is unbearable on the first turn. This becomes more immediate. i am seeing 4+ seconds for first calls vs 1.2 seconds for cached calls.
@davidzhao where do you prewarm STT and TTS in livekit agents. i guess its not implemented for any plugins. We are seeing very high first speech latency from agent after user message. Can you hep me how you prewarm STT and TTS.
@davidzhao Is there a guide how to prewarm LLMs explicitly?
Are there any updates on this?
Hi @davidzhao,
I've been looking into LLM prewarming and wanted to share my findings.
You mentioned that LLMs typically require an inference request for prewarming, unlike STT/TTS which can just open an HTTP connection to /. However, I discovered that for OpenAI and other HTTP-based LLM APIs, we can actually achieve the same connection prewarming benefits without needing to send a full inference request.
For self-hosted LLMs, sending an inference request during prewarm makes sense to "wake up" the model and load it into memory. However, for public LLM services (OpenAI, Anthropic, Google, etc.), the models are already running and serving requests globally. In this case, the primary latency bottleneck is the client-side connection establishment (DNS, TCP, TLS), not the model availability.
I've implemented a prewarm() method for the OpenAI LLM plugin that:
- Makes a lightweight GET request to
/in a background task - Establishes the HTTP connection (DNS resolution, TCP handshake, TLS negotiation)
Test Results
@Pulkit0729 could you share your code even if dirty?
@marctorsoc I have a added a pr where you can see the changes. The changes are only for open ai llm for now and can be added for others as well once the logic is aprooved
thanks @Pulkit0729 , I took a look and left a comment. Looks good to me!
@Pulkit0729 this is for the openai plugin. How would this go if I use openai via livekit inference? is it worth pre-warming and stick to the plugin vs using livekit inference? (supposed to be more stable, not sure about latency)
@davidzhao @longcw is this in the roadmap / already implemented in 1.3?