Fixes #4388: Correct transcription_delay metric calculation in STT turn detec…
Summary
Fixes #4388
This PR fixes the incorrect transcription_delay metric calculation when using STT-based turn detection (e.g., Deepgram Flux).
Problem
When using STT turn detection mode, the transcription_delay metric incorrectly shows ~0 seconds instead of reflecting the actual transcription latency.
User-Reported Behavior:
"EOU metrics showing ~0.79 transcription_delay when should reflect actual processing time"
The metric should measure the time between when the user stopped speaking and when the transcript was received, but it was always returning near-zero values.
Root Cause
In audio_recognition.py, the transcription_delay is calculated as:
transcription_delay = max(last_final_transcript_time - last_speaking_time, 0)
The bug was in the STT END_OF_SPEECH handler (line 452), which overwrote _last_speaking_time with time.time():
elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
...
self._last_speaking_time = time.time() # BUG: Overwrites the value!
Event Timeline in STT Mode (Buggy):
- START_OF_SPEECH →
_last_speaking_time = time.time()(correct) - FINAL_TRANSCRIPT →
_last_final_transcript_time = time.time()(correct) - END_OF_SPEECH →
_last_speaking_time = time.time()(BUG - overwrites!)
Since END_OF_SPEECH typically arrives shortly after FINAL_TRANSCRIPT in STT mode, both timestamps become nearly identical, resulting in transcription_delay ≈ 0.
Solution
Remove the line that overwrites _last_speaking_time at END_OF_SPEECH in STT mode. The value was already correctly set at START_OF_SPEECH.
Comparison with VAD Mode:
VAD mode does NOT update _last_speaking_time at END_OF_SPEECH - it keeps the value from the last INFERENCE_DONE event. STT mode should follow the same pattern.
After Fix:
- START_OF_SPEECH →
_last_speaking_time = time.time()(preserved) - FINAL_TRANSCRIPT →
_last_final_transcript_time = time.time() - END_OF_SPEECH → No overwrite
Result: transcription_delay = last_final_transcript_time - last_speaking_time now correctly represents the actual transcription latency.
Testing
All 15 existing agent session tests pass:
tests/test_agent_session.py::test_events_and_metrics PASSED
tests/test_agent_session.py::test_tool_call PASSED
tests/test_agent_session.py::test_interruption[False-5.5] PASSED
tests/test_agent_session.py::test_interruption[True-5.5] PASSED
tests/test_agent_session.py::test_interruption_options PASSED
tests/test_agent_session.py::test_interruption_by_text_input PASSED
tests/test_agent_session.py::test_interruption_before_speaking[False-3.5] PASSED
tests/test_agent_session.py::test_interruption_before_speaking[True-3.5] PASSED
tests/test_agent_session.py::test_generate_reply PASSED
tests/test_agent_session.py::test_preemptive_generation[True-0.8] PASSED
tests/test_agent_session.py::test_preemptive_generation[False-1.1] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-2.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-2.0] PASSED
======================== 15 passed in 75.96s ========================
Backward Compatibility
No breaking changes - This fix only corrects the metric calculation. The actual agent behavior (speech recognition, turn detection, interruption handling) is completely unchanged.
Expected Impact:
- Users with STT turn detection will now see accurate
transcription_delayvalues in their metrics - Dashboards showing this metric will now report correct latency (previously under-reported as ~0)
Edge Cases Handled
- No VAD present - Already handled at lines 376-382, falls back to STT timestamps
-
Multiple speech segments - START_OF_SPEECH updates
_last_speaking_timefor each new segment -
Preflight transcripts - Also update
_last_final_transcript_timecorrectly - VAD mode unchanged - Fix only affects STT turn detection mode
Files Changed
livekit-agents/livekit/agents/voice/audio_recognition.py
- Removed the buggy
self._last_speaking_time = time.time()line from END_OF_SPEECH handler - Added explanatory comment documenting why we don't update the timestamp here
Related Issues
- Issue #4325: min_endpointing_delay behavior differences between VAD and STT modes (related timing inconsistency)