preemptive_generation=True causes duplicate LLM requests and doubled token costs

Open javguitor opened this issue 1 month ago • 1 comments

Bug Description

When preemptive_generation=True is enabled in AgentSession, two separate LLM requests are made per user turn, both completing successfully (both with cancelled=False), resulting in doubled token consumption and API costs.

Expected Behavior

According to the documentation:

When True, the agent sends inference calls as soon as a user transcript is received rather than waiting for a definitive turn boundary. This can reduce response latency by overlapping model inference with user audio, but may incur extra compute if the user interrupts or revises mid-utterance.

The expected behavior is:

Preemptive generation starts when STT detects end of phrase
When turn is confirmed:
- If context unchanged: Reuse the preemptive generation (✅ works)
- If context changed: Cancel the preemptive generation and make ONE new request

Actual Behavior

When the context/tools change between preemptive generation and turn confirmation:

Preemptive generation completes successfully → Emits metrics with cancelled=False
Context/tools change detected
Code attempts to cancel preemptive generation (but it's already completed)
New generation starts and completes successfully → Emits metrics with cancelled=False

Result: Two complete LLM requests with identical token counts, doubling the costs.

Reproduction Steps

Here's a real example from production logs showing duplicate LLM metrics:

First Request:

request_id: chatcmpl-CkvaZ4BEPx98rb7xTqtcXCuuMDibA
cancelled: False
Tokens: 14858 (prompt: 14844, completion: 14, cached: 0)
TTFT: 0.890s
Duration: 0.96s

Second Request:

request_id: chatcmpl-Ckvb0fjHyUfEq3lHnnsBFrLyrDQqN
cancelled: False
Tokens: 14858 (prompt: 14847, completion: 11, cached: 14720)
TTFT: 0.532s
Duration: 0.60s

Notice:

Different request_id (two separate requests)
Both cancelled=False (both completed successfully)
Same total tokens (14858)
Second request has cached tokens (14720) from the first request

Operating System

Linux

Models Used

OpenAI (gpt-4o-2024-11-20)

Package Versions

livekit-agents version: 1.2.16 (also affects latest main branch)

Session/Room/Call IDs

room ID: RM_HtuGWG9m6zdn

Proposed Solution

###  Cancel the asyncio task (Recommended)

# 1. Store the `asyncio.Task` reference in `_PreemptiveGeneration`:

@dataclass
class _PreemptiveGeneration:
    speech_handle: SpeechHandle
    task: asyncio.Task[None]  # NEW: Store the task reference
    # ...


# 2. Cancel the task in `_cancel_preemptive_generation`:

def _cancel_preemptive_generation(self) -> None:
    if self._preemptive_generation is not None:
        self._preemptive_generation.speech_handle._cancel()
        
        # NEW: Cancel the asyncio task
        if self._preemptive_generation.task and not self._preemptive_generation.task.done():
            self._preemptive_generation.task.cancel()
        
        self._preemptive_generation = None


# 3. Check for cancellation before emitting LLM metrics in `generation.py`:

# Before emitting metrics
if speech_handle and speech_handle.interrupted:
    return  # Don't emit metrics for cancelled generations

session.emit("metrics_collected", MetricsCollectedEvent(metrics=llm_metrics))

Additional Context

This issue was discovered while investigating unexpectedly high LLM token usage
The second request often shows high cached token counts from OpenAI's prompt caching, confirming it's a duplicate of the first request

Screenshots and Recordings

No response

Dec 10 '25 08:12 javguitor