preemptive_generation=True causes duplicate LLM requests and doubled token costs
Bug Description
When preemptive_generation=True is enabled in AgentSession, two separate LLM requests are made per user turn, both completing successfully (both with cancelled=False), resulting in doubled token consumption and API costs.
Expected Behavior
According to the documentation:
When True, the agent sends inference calls as soon as a user transcript is received rather than waiting for a definitive turn boundary. This can reduce response latency by overlapping model inference with user audio, but may incur extra compute if the user interrupts or revises mid-utterance.
The expected behavior is:
- Preemptive generation starts when STT detects end of phrase
- When turn is confirmed:
- If context unchanged: Reuse the preemptive generation (✅ works)
- If context changed: Cancel the preemptive generation and make ONE new request
Actual Behavior
When the context/tools change between preemptive generation and turn confirmation:
- Preemptive generation completes successfully → Emits metrics with
cancelled=False - Context/tools change detected
- Code attempts to cancel preemptive generation (but it's already completed)
- New generation starts and completes successfully → Emits metrics with
cancelled=False
Result: Two complete LLM requests with identical token counts, doubling the costs.
Reproduction Steps
Here's a real example from production logs showing duplicate LLM metrics:
First Request:
-
request_id:chatcmpl-CkvaZ4BEPx98rb7xTqtcXCuuMDibA -
cancelled:False - Tokens:
14858(prompt: 14844, completion: 14, cached: 0) - TTFT:
0.890s - Duration:
0.96s
Second Request:
-
request_id:chatcmpl-Ckvb0fjHyUfEq3lHnnsBFrLyrDQqN -
cancelled:False - Tokens:
14858(prompt: 14847, completion: 11, cached: 14720) - TTFT:
0.532s - Duration:
0.60s
Notice:
- Different
request_id(two separate requests) - Both
cancelled=False(both completed successfully) - Same total tokens (
14858) - Second request has cached tokens (
14720) from the first request
Operating System
Linux
Models Used
OpenAI (gpt-4o-2024-11-20)
Package Versions
livekit-agents version: 1.2.16 (also affects latest main branch)
Session/Room/Call IDs
room ID: RM_HtuGWG9m6zdn
Proposed Solution
### Cancel the asyncio task (Recommended)
# 1. Store the `asyncio.Task` reference in `_PreemptiveGeneration`:
@dataclass
class _PreemptiveGeneration:
speech_handle: SpeechHandle
task: asyncio.Task[None] # NEW: Store the task reference
# ...
# 2. Cancel the task in `_cancel_preemptive_generation`:
def _cancel_preemptive_generation(self) -> None:
if self._preemptive_generation is not None:
self._preemptive_generation.speech_handle._cancel()
# NEW: Cancel the asyncio task
if self._preemptive_generation.task and not self._preemptive_generation.task.done():
self._preemptive_generation.task.cancel()
self._preemptive_generation = None
# 3. Check for cancellation before emitting LLM metrics in `generation.py`:
# Before emitting metrics
if speech_handle and speech_handle.interrupted:
return # Don't emit metrics for cancelled generations
session.emit("metrics_collected", MetricsCollectedEvent(metrics=llm_metrics))
Additional Context
- This issue was discovered while investigating unexpectedly high LLM token usage
- The second request often shows high cached token counts from OpenAI's prompt caching, confirming it's a duplicate of the first request
Screenshots and Recordings
No response