agents
agents copied to clipboard
Agent transcription missing in conversation_item_added
Hello livekit community,
I'm trying to capture transcriptions of a user outbound/inbound call through telephony integration. When I hook on to the conversation_item_added event, I seem to only get the user content transcribed. When the assistant event is emitted, the content is empty. Right now I have a workaround intercepting the agent text in the transcription_node and then parsing these two streams back together but that is clunky and produces not so ideal transcripts, as sometimes the agent doesn't say what goes in the transcription_node pipeline as per my observation.
example:
user conversation item:
type='conversation_item_added' item=ChatMessage(id='item_aa077ed2b7bf', type='message', role='user', content=['Third prompt, please.'], interrupted=False, hash=None)
agent conversation item:
type='conversation_item_added' item=ChatMessage(id='item_291b8e0c549f', type='message', role='assistant', content=[''], interrupted=False, hash=None)
I am on a call with the agent as this is happening and they speak properly, so there's no issue there. There are no interruptions neither as a bug reported 2 weeks ago by one of your devs, where if interrupted sometime the content would not be registered. Also, in the transcripts I make using the transcription node, agent speech gets captured. Am I doing something wrong? I'm running 1.0.19 version of plugins and 1.0.6 of livekit.
Any help would be appreciated.
I also get this warning beforehand, not sure if that's any relevant.
2025-05-06 12:53:16,320 - WARNING livekit.agents - _SegmentSynchronizerImpl.playback_finished called before text/audio input is done {"pid": 34776, "job_id": "AJ_pG2PwtAGWnqE"}
conversation item added
type='conversation_item_added' item=ChatMessage(id='item_ffb33f0fe922', type='message', role='assistant', content=[''], interrupted=False, hash=None, created_at=1746528803.5851429)
what models are you using? can you share your agent init code?
Sure, here are session & agent init snippets inside entrypoint
# getting user cell number, lookup client info in db
# preparing userdata object
...
session = AgentSession[UserData](
userdata=userdata,
turn_detection="vad",
vad=silero.VAD.load(),
stt=deepgram.STT(),
tts=elevenlabs.TTS(
voice_id="cgSgspJ2msm6clMCkdW9", # jessica
model="eleven_flash_v2_5",
voice_settings=VoiceSettings(
speed=1,
stability=0.9,
similarity_boost=1
)
),
llm=openai.LLM(model="gpt-4o"),
)
transcription = CallTranscription(name=full_name, db_session=db_session, client_id=client_data["id"])
agent = OutboundCaller(transcription_service=transcription)
....
asyncio.create_task(
session.start(
agent=agent,
room=ctx.room,
)
)
...
...
if not is_inbound:
try:
await ctx.api.sip.create_sip_participant(
room_name=ctx.room.name,
sip_trunk_id=OUTBOUND_TRUNK_ID,
sip_call_to=phone_number,
participant_identity=PARTICIPANT_IDENTITY,
)
participant = await ctx.wait_for_participant(identity=PARTICIPANT_IDENTITY)
agent.set_participant(participant)
....
Agent definition:
class OutboundCaller(Agent):
def __init__(
self,
*,
transcription_service: CallTranscription,
# my custom written transcription service
):
super().__init__(
instructions=load_prompt("default_prompt.yaml")
)
self.transcription_service = transcription_service
self.participant: rtc.RemoteParticipant | None = None
self.chosen_prompt: Tuple[int, str] | None = None
# rest of agent code ....
CallTranscription is a simple wrapper class that exposes register_user_input and register_agent_input and handles file/db IO for me. These two I call in transcription_node and @session.on("user_input_transcribed"). My goal would be to just register conversation_added_event and switch based on role, then call either of these methods with event.content.
conversation_added_event works on my end with this example https://github.com/livekit/agents/blob/main/examples/voice_agents/basic_agent.py, can you try that one or share a min example that can reproduce this issue?
I remember a PR not long ago, the content would be empty if the audio is not played. And the warning "_SegmentSynchronizerImpl.playback_finished called before text/audio input is done" seems to indicate 1) the played time might be zero and 2) thus empty text as one possible explanation.
@ChenghaoMou I think this might be the culprit, since this warning only gets raised when the agent speaks, and it seems to be raised almost simultaneously with when I receive the event with empty transcript.
https://github.com/livekit/agents/blob/80648a91908d1dd10ad818fe72e7f217e8bdd1d2/livekit-agents/livekit/agents/voice/transcription/synchronizer.py#L254 this comment also suggests that in order to receive the full transcript, the audio needs to be both done and not interrupted, whichi it isnt in my case.
I'm not sure I understand the problem though. I can observe the same behavior even if I don't speak, so there's no accidental interruption. At the same time, I can have a very nice, lengthy conversation with the agent over the phone, so the audios do finish playing, at least from the user's naive perspective. I don't work with speech handles, all of my explicit calls to say or generate_reply are not interrupted/messed with on my side. Maybe I'm missing something but I don't quite understand the playback_finished called before text/audio input is done warning.
Do you guys have an idea how I could go around this?
@longcw I can try setting up a min example that reproduces the issue in the upcoming days, will ping once I do. In the meantime it would be awesome if you could help me understand the warning better so I can try debugging this.
The easiest way I can think of to debug this is just put print/raise/set_trace in your installed livekit code for that mark_playback_finished function. Check where the caller is from (stack trace) and what parameters are used. You can even change the final flag to see if the transcript gets returned properly.
The transcript doesn't get computed properly. in mark_playback_finished, _text_input is not done. This would require end_text_input to be called, which it seems like isn't. _TextData object inside mark_playback_finished always has pushed_text and forwarded_text as empty strings, so it's not a matter of the final flag either. synchronized_transcript is None. Also, it seems like the async loop in voice/transcription/synchronizer.py L297 is never ran? Is it possible that I am not getting the sentence stream?
Playback_position corresponded to the time it took the audio to play. I think the audio part of the stream is all right in this case.
can you share what your agent config looks like? which models?
Is this fixed @davidzhao ? which version do we need to use?
which version do we need to use?
we cannot reproduce it, you can use the latest version. Or could you share a reproducible example if you have the issue in a particular version?
I’m seeing the same issue. It happens mainly in the interrupts flow (when the user interrupts the agent). There may be logic errors that cause thread blocking, which makes the audio fed into STT miss buffered data, resulting in incorrect transcripts. I have recordings and I built ring buffers to capture the input; from that it’s clear the user’s voice could have produced a complete transcript.
Polished for a GitHub issue
Title: Incorrect STT transcript during user interruption due to thread blocking / buffer underflow
Description I’m encountering the same issue, primarily in the interrupts flow (when a user interrupts the agent). A logic error appears to cause thread blocking, which leads to missing audio buffers in the STT input and therefore incorrect transcripts.
What I observed
- When the user interrupts the agent, the STT pipeline receives incomplete audio (buffer underflow/missed frames).
- Resulting transcripts are truncated or wrong *. ( i have change params
silero.VAD.load(prefix_padding_duration=1its just duplicate my buffer rings, so that i think my code and evidence is correct)
Evidence
- I recorded the audio and implemented ring buffers to capture the input segment being fed to STT. The captured audio shows the user’s voice is complete, so the loss seems to occur before/within the STT feed rather than at capture.
Suspected cause
- Logic in the interrupt flow triggers thread blocking/contention, which prevents timely enqueueing of audio buffers to STT.
Expected behavior
- Interrupts should be handled without blocking the audio pipeline; STT should receive the full buffered audio and produce a correct transcript.