agents icon indicating copy to clipboard operation
agents copied to clipboard

preemptive_generation=True causes duplicate LLM requests and doubled token costs

Open javguitor opened this issue 1 month ago • 1 comments

Bug Description

When preemptive_generation=True is enabled in AgentSession, two separate LLM requests are made per user turn, both completing successfully (both with cancelled=False), resulting in doubled token consumption and API costs.

Expected Behavior

According to the documentation:

When True, the agent sends inference calls as soon as a user transcript is received rather than waiting for a definitive turn boundary. This can reduce response latency by overlapping model inference with user audio, but may incur extra compute if the user interrupts or revises mid-utterance.

The expected behavior is:

  1. Preemptive generation starts when STT detects end of phrase
  2. When turn is confirmed:
    • If context unchanged: Reuse the preemptive generation (✅ works)
    • If context changed: Cancel the preemptive generation and make ONE new request

Actual Behavior

When the context/tools change between preemptive generation and turn confirmation:

  1. Preemptive generation completes successfully → Emits metrics with cancelled=False
  2. Context/tools change detected
  3. Code attempts to cancel preemptive generation (but it's already completed)
  4. New generation starts and completes successfully → Emits metrics with cancelled=False

Result: Two complete LLM requests with identical token counts, doubling the costs.

Reproduction Steps

Here's a real example from production logs showing duplicate LLM metrics:

First Request:

  • request_id: chatcmpl-CkvaZ4BEPx98rb7xTqtcXCuuMDibA
  • cancelled: False
  • Tokens: 14858 (prompt: 14844, completion: 14, cached: 0)
  • TTFT: 0.890s
  • Duration: 0.96s

Second Request:

  • request_id: chatcmpl-Ckvb0fjHyUfEq3lHnnsBFrLyrDQqN
  • cancelled: False
  • Tokens: 14858 (prompt: 14847, completion: 11, cached: 14720)
  • TTFT: 0.532s
  • Duration: 0.60s

Notice:

  • Different request_id (two separate requests)
  • Both cancelled=False (both completed successfully)
  • Same total tokens (14858)
  • Second request has cached tokens (14720) from the first request

Operating System

Linux

Models Used

OpenAI (gpt-4o-2024-11-20)

Package Versions

livekit-agents version: 1.2.16 (also affects latest main branch)

Session/Room/Call IDs

room ID: RM_HtuGWG9m6zdn

Proposed Solution

###  Cancel the asyncio task (Recommended)

# 1. Store the `asyncio.Task` reference in `_PreemptiveGeneration`:

@dataclass
class _PreemptiveGeneration:
    speech_handle: SpeechHandle
    task: asyncio.Task[None]  # NEW: Store the task reference
    # ...


# 2. Cancel the task in `_cancel_preemptive_generation`:

def _cancel_preemptive_generation(self) -> None:
    if self._preemptive_generation is not None:
        self._preemptive_generation.speech_handle._cancel()
        
        # NEW: Cancel the asyncio task
        if self._preemptive_generation.task and not self._preemptive_generation.task.done():
            self._preemptive_generation.task.cancel()
        
        self._preemptive_generation = None


# 3. Check for cancellation before emitting LLM metrics in `generation.py`:

# Before emitting metrics
if speech_handle and speech_handle.interrupted:
    return  # Don't emit metrics for cancelled generations

session.emit("metrics_collected", MetricsCollectedEvent(metrics=llm_metrics))

Additional Context

  • This issue was discovered while investigating unexpectedly high LLM token usage
  • The second request often shows high cached token counts from OpenAI's prompt caching, confirming it's a duplicate of the first request

Screenshots and Recordings

No response

javguitor avatar Dec 10 '25 08:12 javguitor