Transcript errors when using ElevenLabs Flash 2.5 TTS
A few users have reported errors in their bot transcripts that look like this:
For example. "Have you ever been hospitt alized" or "I understand. Does anyone in your ff amily have a history..."
We've tracked this to an issue with ElevenLabs's Flash 2.5 model. We've reported it to them, and they're working on a fix. In the meantime, we recommend using the Turbo 2.5 model, or another provider.
We'll close this issue when it's resolved.
@chadbailey59 were we able to get anywhere from here? Did 11labs respond?
The ElevenLabs team acknowledged the issue and are working on it.
Is this ElevenLabs or LLM? LLM is the one who usually produces these outputs.
This is ElevenLabs. ElevenLabs outputs word/timestamp pairs, which we use to determine what the bot says, down to the word level. The TTS service outputs these as TTSTextFrames.
Is there a corresponding issue on the elevenlabs side we can monitor?
Is there a corresponding issue on the elevenlabs side we can monitor?
Good question. I just asked the 11Labs team. I'll post when I hear back.
Did 11Labs ever respond about this one?
Did 11Labs ever respond about this one?
They acknowledge the issue again, but didn't provide a timeline for a fix...
@markbackman any timeline for this fix?
@tarungarg546 no timeline. The 11Labs teams is aware of the issue. I've discussed it a handful of times with them, as frequently as a few weeks ago. This is a model issue that they're working on, AFAIK.
i had that same issue, and i solved it by redefining the function calculate_word_times and using the following parametrization
context_aggregator = llm.create_context_aggregator(
llm_context,
assistant_params=LLMAssistantAggregatorParams(expect_stripped_words=False),
)
The problem with elevenlabs, is that in the chunk that we receive, some words may be split. So we can not proceed to a strip() post process and then add space back, because we would add an artificial space between 2 parts of the same word
So, i monkey patch the calculate_word_times from Elevenlabs with this function
def calculate_word_times_patched(
alignment_info: Mapping[str, Any], cumulative_time: float
) -> List[Tuple[str, float]]:
"""
Patched version of pipecat.services.elevenlabs.tts import ElevenLabsTTSService, calculate_word_time
Reason:
- as elevenlabs does not return always complete words, the fact that this processor acts as it was complete words,
space will be added always between this words.
Solution:
we prefer to no expect complete words, and keep the space from the alignement chars
"""
zipped_times = list(
zip(alignment_info["chars"], alignment_info["charStartTimesMs"])
)
words = []
times = []
current_word = ""
for i, (a, b) in enumerate(zipped_times):
if a == " " or i == len(zipped_times) - 1:
# Create a new word if we reach a space or end of zipped_times list.
# Note: it is also probably that the end of the chunk is not the end of a word!
current_word += a
if current_word.strip() != "":
# Only add non-empty words
t = cumulative_time + (zipped_times[i][1] / 1000.0)
times.append(t)
words.append(current_word)
current_word = ""
else:
current_word += a
if alignment_info["chars"][0] == " ":
# If first character is space, we add it
words[0] = " " + words[0]
word_times = list(zip(words, times))
return word_times
I also experienced this issue. As mentionned in #2117 I solved this by patching the words segmentation as follow:
Replace the word segmentation in calculate_word_times() function from:
words = "".join(alignment_info["chars"]).split(" ")
to:
words = re.findall(r'\S+ ?', "".join(alignment_info["chars"]))
Hi @markbackman, I found that you have just updated src/pipecat/services/elevenlabs/tts.py file on this commit. Does this commit fix this transcript issue?
Hi @markbackman, I found that you have just updated
src/pipecat/services/elevenlabs/tts.pyfile on this commit. Does this commit fix this transcript issue?
No, this fixes a word/timestamp alignment issue. The word splitting is an ElevenLabs issue where they deliver chunks that have words split across the chunk. It's an issue we're discussing with the ElevenLabs team.
Just chiming in to add that elevenlabs flash v2 (english-only) also has the problem (not just 2.5)... and yes, it's still doing it...
Logging after LLM:
Oct 02 17:50:12.279
2025-10-03 00:50:12.279 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > LLM Output: LLMTextFrame#38(pts: None, text: [ the])
Oct 02 17:50:12.279
2025-10-03 00:50:12.279 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > LLM Output: LLMTextFrame#39(pts: None, text: [ perfect])
Oct 02 17:50:12.279
2025-10-03 00:50:12.279 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > LLM Output: LLMTextFrame#40(pts: None, text: [ time])
logging after TTS:
Oct 02 17:50:15.701
2025-10-03 00:50:15.701 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > Final Aggregator Input: TTSTextFrame#43(pts: 0:00:26.011836, text: [the])
Oct 02 17:50:15.806
2025-10-03 00:50:15.806 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > Final Aggregator Input: TTSTextFrame#44(pts: 0:00:26.116836, text: [perfe])
Oct 02 17:50:16.038
2025-10-03 00:50:16.038 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > Final Aggregator Input: TTSTextFrame#45(pts: 0:00:26.348836, text: [ct])
Oct 02 17:50:16.131
2025-10-03 00:50:16.131 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > Final Aggregator Input: TTSTextFrame#46(pts: 0:00:26.441836, text: [time])
@captaincaius yes, all Flash models have this issue.
Do you have any recommended workaround for this issue, @markbackman? Also, does it work with models other than the flash ones?
Do you have any recommended workaround for this issue, @markbackman? Also, does it work with models other than the flash ones?
Turbo v2.5 works great and has more consistent speech, in my experience.
Fix pending in https://github.com/pipecat-ai/pipecat/pull/2840.
This is still an issue in the ElevenLabs Flash model, but we now have a workaround for the solution.
Though, many developers have pointed out that the Flash model has mispronunciations that coincide with the word splitting that's happening with the Flash model. For this reason, we will leave the Turbo v2.5 as the default model. I've reported the mispronunciation issue to the ElevenLabs team so that they're aware of it.
For now, this issue is closed. No more word splitting in Pipecat with the Flash models.