Cartesia SSML tags with decimal attributes (e.g., <speed ratio="1.05"/>) get split by TTS text aggregator; controls dropped
pipecat version
0.0.92
Python version
3.11
Operating System
macOS 12.5
Issue description
When sending SSML to Cartesia (Sonic-3) with decimal attributes (e.g.,
Reproduction steps
Assistant message sent to TTS:
Expected behavior
Cartesia receives text where
Actual behavior
Cartesia should receive SSML intact and apply speed/volume as per docs.
Logs
## Root Cause Analysis
**Code pointers (likely cause: sentence aggregation + EOS detection on "."):**
1. **TTS Service processes text through aggregator:**
# pipecat/src/pipecat/services/tts_service.py:454-462
async def _process_text_frame(self, frame: TextFrame):
text: Optional[str] = None
if not self._aggregate_sentences:
text = frame.text
else:
text = await self._text_aggregator.aggregate(frame.text)
2. **SkipTagsAggregator uses NLTK sentence tokenizer which splits on dots:**
# pipecat/src/pipecat/utils/text/skip_tags_aggregator.py:69-86
if not self._current_tag:
eos_marker = match_endofsentence(self._text)
if eos_marker:
result = self._text[:eos_marker]
self._text = self._text[eos_marker:]
return result
3. **match_endofsentence uses NLTK sent_tokenize which treats dots as sentence boundaries:**
# pipecat/src/pipecat/utils/string.py:131-151
sentences = sent_tokenize(text)
if len(sentences) > 1:
return len(first_sentence)
4. **Cartesia integration defaults to only skipping `<spell>` pairs, so self-closing tags aren't protected:**
# pipecat/src/pipecat/services/cartesia/tts.py:186-187
text_aggregator=text_aggregator or SkipTagsAggregator([("<spell>", "</spell>")]),
**The issue:** When text like `<speed ratio="1.05"/>` is processed, NLTK's sentence tokenizer sees the dot in "1.05" as a sentence boundary and splits the text, truncating the SSML tag before it reaches Cartesia.
Interesting... I am currently using pipecat-ai==0.0.90 and this has not been an issue for me; however I did have to modify my text accumulator, as it by default removes SSML tags for transcription purposes; but not before it is sent to the SSML provider (Cartesia in my case). I am wondering if there is a config issue in the code on your side?
We've recently added some very powerful text processing capabilities to Pipecat. We've recently updated the docs about what's possible (https://docs.pipecat.ai/guides/learn/text-to-speech#text-processing-and-filtering).
Here's a a very targeted response is that you can now easily handle speed tags for Cartesia. Check out the docs: https://docs.pipecat.ai/server/services/tts/cartesia#speed-tag-speed:-float-%3E-str:
To use this, you need to update to pipecat-ai 0.0.96.