pipecat Transcript errors when using ElevenLabs Flash 2.5 TTS

A few users have reported errors in their bot transcripts that look like this:

For example. "Have you ever been hospitt alized" or "I understand. Does anyone in your ff amily have a history..."

We've tracked this to an issue with ElevenLabs's Flash 2.5 model. We've reported it to them, and they're working on a fix. In the meantime, we recommend using the Turbo 2.5 model, or another provider.

We'll close this issue when it's resolved.

Jan 13 '25 17:01 chadbailey59

@chadbailey59 were we able to get anywhere from here? Did 11labs respond?

Mar 18 '25 09:03 tarungarg546

The ElevenLabs team acknowledged the issue and are working on it.

Mar 18 '25 11:03 markbackman

Is this ElevenLabs or LLM? LLM is the one who usually produces these outputs.

Mar 28 '25 11:03 tarungarg546

This is ElevenLabs. ElevenLabs outputs word/timestamp pairs, which we use to determine what the bot says, down to the word level. The TTS service outputs these as TTSTextFrames.

Mar 28 '25 11:03 markbackman

Is there a corresponding issue on the elevenlabs side we can monitor?

Apr 07 '25 17:04 larry-cmz

Is there a corresponding issue on the elevenlabs side we can monitor?

Good question. I just asked the 11Labs team. I'll post when I hear back.

Apr 07 '25 22:04 markbackman

Did 11Labs ever respond about this one?

May 05 '25 18:05 larry-cmz

Did 11Labs ever respond about this one?

They acknowledge the issue again, but didn't provide a timeline for a fix...

May 05 '25 19:05 markbackman

@markbackman any timeline for this fix?

Jun 19 '25 10:06 tarungarg546

@tarungarg546 no timeline. The 11Labs teams is aware of the issue. I've discussed it a handful of times with them, as frequently as a few weeks ago. This is a model issue that they're working on, AFAIK.

Jun 19 '25 11:06 markbackman

i had that same issue, and i solved it by redefining the function calculate_word_times and using the following parametrization

context_aggregator = llm.create_context_aggregator(
            llm_context,
            assistant_params=LLMAssistantAggregatorParams(expect_stripped_words=False),
)

The problem with elevenlabs, is that in the chunk that we receive, some words may be split. So we can not proceed to a strip() post process and then add space back, because we would add an artificial space between 2 parts of the same word

So, i monkey patch the calculate_word_times from Elevenlabs with this function

def calculate_word_times_patched(
    alignment_info: Mapping[str, Any], cumulative_time: float
) -> List[Tuple[str, float]]:
    """
    Patched version of pipecat.services.elevenlabs.tts import ElevenLabsTTSService, calculate_word_time
    Reason:
    - as elevenlabs does not return always complete words, the fact that this processor acts as it was complete words,
    space will be added always between this words.

    Solution:
    we prefer to no expect complete words, and keep the space from the alignement chars
    """
    zipped_times = list(
        zip(alignment_info["chars"], alignment_info["charStartTimesMs"])
    )
    words = []
    times = []
    current_word = ""
    for i, (a, b) in enumerate(zipped_times):
        if a == " " or i == len(zipped_times) - 1:
            # Create a new word if we reach a space or end of zipped_times list.
            # Note: it is also probably that the end of the chunk is not the end of a word!
            current_word += a
            if current_word.strip() != "":
                # Only add non-empty words
                t = cumulative_time + (zipped_times[i][1] / 1000.0)
                times.append(t)
                words.append(current_word)
            current_word = ""
        else:
            current_word += a

    if alignment_info["chars"][0] == " ":
        # If first character is space, we add it
        words[0] = " " + words[0]

    word_times = list(zip(words, times))

    return word_times

Jul 03 '25 10:07 anotine10

I also experienced this issue. As mentionned in #2117 I solved this by patching the words segmentation as follow:

Replace the word segmentation in calculate_word_times() function from:

words = "".join(alignment_info["chars"]).split(" ")

to:

words = re.findall(r'\S+ ?', "".join(alignment_info["chars"]))

Jul 03 '25 12:07 JeremyRss

Hi @markbackman, I found that you have just updated src/pipecat/services/elevenlabs/tts.py file on this commit. Does this commit fix this transcript issue?

Jul 22 '25 19:07 NoCtrlZ1110

Hi @markbackman, I found that you have just updated src/pipecat/services/elevenlabs/tts.py file on this commit. Does this commit fix this transcript issue?

No, this fixes a word/timestamp alignment issue. The word splitting is an ElevenLabs issue where they deliver chunks that have words split across the chunk. It's an issue we're discussing with the ElevenLabs team.

Jul 22 '25 21:07 markbackman

Just chiming in to add that elevenlabs flash v2 (english-only) also has the problem (not just 2.5)... and yes, it's still doing it...

Logging after LLM:

Oct 02 17:50:12.279
2025-10-03 00:50:12.279 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > LLM Output: LLMTextFrame#38(pts: None, text: [ the])
Oct 02 17:50:12.279
2025-10-03 00:50:12.279 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > LLM Output: LLMTextFrame#39(pts: None, text: [ perfect])
Oct 02 17:50:12.279
2025-10-03 00:50:12.279 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > LLM Output: LLMTextFrame#40(pts: None, text: [ time])

logging after TTS:

Oct 02 17:50:15.701
2025-10-03 00:50:15.701 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > Final Aggregator Input: TTSTextFrame#43(pts: 0:00:26.011836, text: [the])
Oct 02 17:50:15.806
2025-10-03 00:50:15.806 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > Final Aggregator Input: TTSTextFrame#44(pts: 0:00:26.116836, text: [perfe])
Oct 02 17:50:16.038
2025-10-03 00:50:16.038 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > Final Aggregator Input: TTSTextFrame#45(pts: 0:00:26.348836, text: [ct])
Oct 02 17:50:16.131
2025-10-03 00:50:16.131 | DEBUG | pipecat.processors.logger:process_frame:71 | 004930b7-058c-4036-a9ac-d42b9a6abae2 - > Final Aggregator Input: TTSTextFrame#46(pts: 0:00:26.441836, text: [time])

Oct 03 '25 01:10 trycatchal

@captaincaius yes, all Flash models have this issue.

Oct 03 '25 12:10 markbackman

Do you have any recommended workaround for this issue, @markbackman? Also, does it work with models other than the flash ones?

Oct 07 '25 19:10 NoCtrlZ1110

Do you have any recommended workaround for this issue, @markbackman? Also, does it work with models other than the flash ones?

Turbo v2.5 works great and has more consistent speech, in my experience.

Oct 07 '25 19:10 markbackman

Fix pending in https://github.com/pipecat-ai/pipecat/pull/2840.

Oct 14 '25 15:10 markbackman

This is still an issue in the ElevenLabs Flash model, but we now have a workaround for the solution.

Though, many developers have pointed out that the Flash model has mispronunciations that coincide with the word splitting that's happening with the Flash model. For this reason, we will leave the Turbo v2.5 as the default model. I've reported the mispronunciation issue to the ElevenLabs team so that they're aware of it.

For now, this issue is closed. No more word splitting in Pipecat with the Flash models.

Oct 15 '25 13:10 markbackman