Fixes #4388: Correct transcription_delay metric calculation in STT turn detec…

Open devbyteai opened this issue 3 weeks ago • 0 comments

Summary

Fixes #4388

This PR fixes the incorrect transcription_delay metric calculation when using STT-based turn detection (e.g., Deepgram Flux).

Problem

When using STT turn detection mode, the transcription_delay metric incorrectly shows ~0 seconds instead of reflecting the actual transcription latency.

User-Reported Behavior:

"EOU metrics showing ~0.79 transcription_delay when should reflect actual processing time"

The metric should measure the time between when the user stopped speaking and when the transcript was received, but it was always returning near-zero values.

Root Cause

In audio_recognition.py, the transcription_delay is calculated as:

transcription_delay = max(last_final_transcript_time - last_speaking_time, 0)

The bug was in the STT END_OF_SPEECH handler (line 452), which overwrote _last_speaking_time with time.time():

elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    self._last_speaking_time = time.time()  # BUG: Overwrites the value!

Event Timeline in STT Mode (Buggy):

START_OF_SPEECH → _last_speaking_time = time.time() (correct)
FINAL_TRANSCRIPT → _last_final_transcript_time = time.time() (correct)
END_OF_SPEECH → _last_speaking_time = time.time() (BUG - overwrites!)

Since END_OF_SPEECH typically arrives shortly after FINAL_TRANSCRIPT in STT mode, both timestamps become nearly identical, resulting in transcription_delay ≈ 0.

Solution

Remove the line that overwrites _last_speaking_time at END_OF_SPEECH in STT mode. The value was already correctly set at START_OF_SPEECH.

Comparison with VAD Mode: VAD mode does NOT update _last_speaking_time at END_OF_SPEECH - it keeps the value from the last INFERENCE_DONE event. STT mode should follow the same pattern.

After Fix:

START_OF_SPEECH → _last_speaking_time = time.time() (preserved)
FINAL_TRANSCRIPT → _last_final_transcript_time = time.time()
END_OF_SPEECH → No overwrite

Result: transcription_delay = last_final_transcript_time - last_speaking_time now correctly represents the actual transcription latency.

Testing

All 15 existing agent session tests pass:

tests/test_agent_session.py::test_events_and_metrics PASSED
tests/test_agent_session.py::test_tool_call PASSED
tests/test_agent_session.py::test_interruption[False-5.5] PASSED
tests/test_agent_session.py::test_interruption[True-5.5] PASSED
tests/test_agent_session.py::test_interruption_options PASSED
tests/test_agent_session.py::test_interruption_by_text_input PASSED
tests/test_agent_session.py::test_interruption_before_speaking[False-3.5] PASSED
tests/test_agent_session.py::test_interruption_before_speaking[True-3.5] PASSED
tests/test_agent_session.py::test_generate_reply PASSED
tests/test_agent_session.py::test_preemptive_generation[True-0.8] PASSED
tests/test_agent_session.py::test_preemptive_generation[False-1.1] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-2.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-2.0] PASSED

======================== 15 passed in 75.96s ========================

Backward Compatibility

No breaking changes - This fix only corrects the metric calculation. The actual agent behavior (speech recognition, turn detection, interruption handling) is completely unchanged.

Expected Impact:

Users with STT turn detection will now see accurate transcription_delay values in their metrics
Dashboards showing this metric will now report correct latency (previously under-reported as ~0)

Edge Cases Handled

No VAD present - Already handled at lines 376-382, falls back to STT timestamps
Multiple speech segments - START_OF_SPEECH updates _last_speaking_time for each new segment
Preflight transcripts - Also update _last_final_transcript_time correctly
VAD mode unchanged - Fix only affects STT turn detection mode

Files Changed

livekit-agents/livekit/agents/voice/audio_recognition.py

Removed the buggy self._last_speaking_time = time.time() line from END_OF_SPEECH handler
Added explanatory comment documenting why we don't update the timestamp here

Related Issues

Issue #4325: min_endpointing_delay behavior differences between VAD and STT modes (related timing inconsistency)

Dec 26 '25 15:12 devbyteai