pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

Bug Report: Transcript Out of Order in Case of User Interruption [ElevenLabs]

Open rahultyl opened this issue 9 months ago • 6 comments

Bug Report: Transcript Out of Order in Case of User Interruption [ElevenLabs]

Issue Summary

We have observed that the transcript is stored out of order whenever the voice bot face multiple simultaneous interruptions.

Observed Behavior

Due to multiple interruptions, multiple LLM generations occur. While some word timestamps are queued for processing, the new llm response generation resets the cumulative time. As a result, the calculated timestamps for words in the current text sometimes become lower than previous timestamps, leading to the incorrect positioning of words in the transcript.

Example

Stored Transcript (Incorrect Order):
I can help you schedu le an appointment Hi! I'm here to help. in either Let's schedule your a ppointment
Expected Transcript (Correct Order):
I can help you schedule an appointment in either the morning or evening.
Do you have a preference for one over the other?
Hi!
I'm here to help.

Word Timestamps Log (Example Data)

2025-03-12 12:24:23.887 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='I' frame.pts=97234192417 self._initial_word_timestamp=96828192417 timestamp=406000000
2025-03-12 12:24:23.887 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='can' frame.pts=97466192417 self._initial_word_timestamp=96828192417 timestamp=638000000
2025-03-12 12:24:23.887 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='help' frame.pts=97687192417 self._initial_word_timestamp=96828192417 timestamp=859000000
2025-03-12 12:24:23.887 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='you' frame.pts=97826192417 self._initial_word_timestamp=96828192417 timestamp=998000000
2025-03-12 12:24:23.887 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='schedu' frame.pts=98105192417 self._initial_word_timestamp=96828192417 timestamp=1277000000

2025-03-12 12:24:24.015 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='le' frame.pts=98151192417 self._initial_word_timestamp=96828192417 timestamp=1323000000
2025-03-12 12:24:24.015 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='an' frame.pts=98256192417 self._initial_word_timestamp=96828192417 timestamp=1428000000
2025-03-12 12:24:24.015 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='appointment' frame.pts=98825192417 self._initial_word_timestamp=96828192417 timestamp=1997000000
2025-03-12 12:24:24.016 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='in' frame.pts=98976192417 self._initial_word_timestamp=96828192417 timestamp=2148000000
2025-03-12 12:24:24.016 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='either' frame.pts=99266192417 self._initial_word_timestamp=96828192417 timestamp=2438000000
2025-03-12 12:24:24.016 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='the' frame.pts=99405192417 self._initial_word_timestamp=96828192417 timestamp=2577000000
2025-03-12 12:24:24.016 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='morning' frame.pts=99754192417 self._initial_word_timestamp=96828192417 timestamp=2926000000
2025-03-12 12:24:24.016 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='or' frame.pts=99870192417 self._initial_word_timestamp=96828192417 timestamp=3042000000
2025-03-12 12:24:24.016 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='evening.' frame.pts=100323192417 self._initial_word_timestamp=96828192417 timestamp=3495000000

Here the cumulative time is restted to 0 and hence ‘Hi’ word frams.pts become less than word ‘in’ frams.pts, due to which it results in transcript out of order

2025-03-12 12:24:25.604 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='Hi!' frame.pts=97269192417 self._initial_word_timestamp=96828192417 timestamp=441000000


2025-03-12 12:24:25.861 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='I'm' frame.pts=97420192417 self._initial_word_timestamp=96828192417 timestamp=592000000
2025-03-12 12:24:25.861 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='here' frame.pts=97641192417 self._initial_word_timestamp=96828192417 timestamp=813000000
2025-03-12 12:24:25.861 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='to' frame.pts=97745192417 self._initial_word_timestamp=96828192417 timestamp=917000000
2025-03-12 12:24:25.861 | INFO     | pipecat.services.ai_services:_words_task_handler:468 - Word Timestamps: word='help.' frame.pts=98047192417 self._initial_word_timestamp=96828192417 timestamp=1219000000

Expected Behavior

  • The transcript should be stored in the correct order, ensuring that word timestamps should be maintained.

Steps to Reproduce

  1. Initiate a conversation with the voice bot.
  2. Interrupt the bot multiple times in quick succession.
  3. Observe the stored transcript and compare it with the expected order.

Additional Context

  • This issue is occurring specifically when handling multiple user interruptions.
  • The problem is observed with ElevenLabs TTS processing.

Please find attached the detailed logs and transcript-

test_3.log Recording link -link transcript.txt

Sample code gist -link

cc: @Vaibhav159 @tarungarg546

rahultyl avatar Mar 12 '25 09:03 rahultyl

@markbackman @aconchillo similar issue we are facing without interruption too. I have tried one solution by increasing the stop_time_sec in TTS service stop_frame_handler which results in waiting for 7-8 secs instead of 2 secs. After time out, we push the TTSStoppedFrame which resets the the initial_word_timestamp due to which some words of a successive llm response sometimes have less frame pts (presentation timestamp) than already queued words in words_queue.

I want to know is there any downside of it or any factors we should take into consideration while increasing the stop_time_sec?

I have raised it as QnA too- link

Looking forward for your insights on this issue.

rahultyl avatar Mar 20 '25 12:03 rahultyl

Yes, @aconchillo and I were discussing this issue yesterday. We have work to do to keep the word timestamps ordered. We've been heads down on today's release, but will come back to this issue in the next week.

markbackman avatar Mar 20 '25 20:03 markbackman

@markbackman @aconchillo any insight on

I have tried one solution by increasing the stop_time_sec in TTS service stop_frame_handler which results in waiting for 7-8 secs instead of 2 secs.

I want to know is there any downside of it or any factors we should take into consideration while increasing the stop_time_sec?

tarungarg546 avatar Mar 21 '25 04:03 tarungarg546

I'm sorry, I've searched in the code (latest main and don't see stop_time_sec. Do you mean stop_frame_timeout_s?

If so, I'm not sure how this would solve the problem. The longer timeout means that you wait longer in the case where the TTS has stopped producing output but has not pushed a TTSStoppedFrame.

markbackman avatar Mar 21 '25 18:03 markbackman

I'm sorry, I've searched in the code (latest main and don't see stop_time_sec. Do you mean stop_frame_timeout_s?

If so, I'm not sure how this would solve the problem. The longer timeout means that you wait longer in the case where the TTS has stopped producing output but has not pushed a TTSStoppedFrame.

My bad -yes it is stop_frame_timeout_s. @markbackman This is not the permanent fix for the isse, its just the temporary fix for our usecase where we get the jumbled words in case of opening statement (greetings) of the voice bot call.

Yes, you are right, the longer timeout means that you wait longer in the case where the TTS has stopped producing output but has not pushed a TTSStoppedFrame and hence it will not reset _initial_word_timestamp and _cumulative_time and hence the calculated word timestamps for next llm response will consider last word timestamp of previous llm response and hence not breaking the words order in transcript.

Can you just confirm if there is any other downside of increasing the stop_frame_timeout_s?

rahultyl avatar Mar 24 '25 06:03 rahultyl

Can you just confirm if there is any other downside of increasing the stop_frame_timeout_s?

We haven't tested this, so YMMV. I can't see how this would solve the issue, but if it does for you, it should be an OK workaround.

markbackman avatar Mar 24 '25 14:03 markbackman