agents icon indicating copy to clipboard operation
agents copied to clipboard

Agent Text stream not aligned with Avatar video for LiveAvatar

Open rida-amir opened this issue 3 weeks ago • 0 comments

Bug Description

The agent text stream (transcription) is not synchronized with the LiveAvatar’s video and voice during streaming. The transcript appears out of alignment with what the avatar is speaking on screen (appears before avatar voice + video), instead of being displayed in real time as the video is streamed. Other avatars (e.g., Tavus and Anam) correctly support synchronized text, video, and audio playback. LiveAvatar should exhibit the same behavior, with the agent transcription aligned to the avatar’s spoken words as they occur.

Expected Behavior

The agent’s transcription should be synchronized with the LiveAvatar’s video and voice in real time. As the avatar speaks during streaming, the corresponding text should appear simultaneously and progress in alignment with the spoken audio, consistent with the behavior observed in other avatars such as Tavus and Anam.

Reproduction Steps

1. Launch a session using LiveAvatar.
2. Start streaming the agent response (video + voice).
3. Enable or observe the agent text stream / transcription.
4. Notice that the displayed text does not align in real time with the avatar’s spoken audio and video (text appears before video comes later)
5. Compare the same flow using other avatars (e.g., Tavus or Anam), where the transcription appears synchronized with the avatar’s speech.

Operating System

Windows 11

Models Used

Open ai with azure STT 4o transcribe, llm is gpt 4o and tts is gpt 4o mini tts with azure

Package Versions

livekit-agents = 1.3.9
python>=3.13
default/latest livekit version

Session/Room/Call IDs

No response

Proposed Solution

Introduce a synchronization or buffering mechanism between the agent text stream and the LiveAvatar’s audio/video streams. The text output should be buffered and released based on audio/video playback timestamps, ensuring the transcription is displayed only when the corresponding speech is rendered by the avatar. This should mirror the synchronization logic already used for other avatars (e.g., Tavus and Anam) to maintain consistent real-time alignment across all avatar types.

Additional Context

No response

Screenshots and Recordings

No response

rida-amir avatar Dec 29 '25 10:12 rida-amir