pipecat Feature Request: AzureTTSService should emit word boundaries via synthesis_word

Problem Statement

Currently, Pipecat’s AzureTTSService does NOT emit word boundaries, even though the Azure SDK supports this via the synthesis_word_boundary event.

Proposed Solution

Subscribe to the synthesis_word_boundary event in AzureTTSService.start()

Handle each word boundary event and emit a TTSTextFrame with word-level text and the corresponding audio offset (timing)
Use evt.text for the word and evt.audio_offset (divide by 10000 for ms) for synchronization

Example implementation:

def _handle_word_boundary(self, evt):
    # evt.text contains the word
    # evt.audio_offset contains timing (in ticks, divide by 10000 for ms)
    pass

async def start(self, frame: StartFrame):
    await super().start(frame)
    # ... existing setup code ...
    self._speech_synthesizer.synthesis_word_boundary.connect(self._handle_word_boundary)

Alternative Solutions

No response

Additional Context

References:

Azure SDK sample for word boundaries

Would you be willing to help implement this feature?

[x] Yes, I'd like to contribute
[ ] No, I'm just suggesting

Oct 27 '25 08:10 remisharrock

@remisharrock if you're willing to contribute, that would be really great!

You would need to change the base class to InterruptibleWordTTSService and handle the word calculation. There are examples of how this is done in Cartesia, ElevenLabs, and Rime. Please tag me on the PR if you get time to work on this.

Oct 27 '25 22:10 markbackman

Dear @markbackman, I investigated this issue with my copilot partner as you can see.

I noticed you initially suggested using InterruptibleWordTTSService as the base class. However, after studying the Azure SDK architecture and the existing implementations, it looks like it's better to use WordTTSService instead, but I want your opinion on that.

What I understood with my copilot is that, unlike Cartesia, ElevenLabs, and Rime which require explicit WebSocket connection management, the Azure Speech SDK:

Handles all WebSocket connections internally via SpeechSynthesizer
Provides high-level callbacks (synthesis_word_boundary, synthesizing, synthesis_completed)
Does not expose or require manual WebSocket handling

Why InterruptibleWordTTSService doesn't fit:

InterruptibleWordTTSService requires implementing abstract methods like:

_connect_websocket()
_disconnect_websocket()
_receive_messages()
_connect()
_disconnect()

These methods don't make sense for Azure because we never directly interact with the WebSocket - it's completely abstracted away by the SDK.

So the attempted implementation:

Uses WordTTSService which provides word timestamp support without requiring WebSocket abstractions
Azure SDK callbacks are synchronous, so it uses put_nowait() directly to add timestamps, do you also think it's a good practice ?

Let me know if you think InterruptibleWordTTSService is still necessary for other reasons, or if this approach works for the main repository ?

all that is explained in my fork and in this pull request: https://github.com/remisharrock/pipecat/pull/2

Something else:

While working on the Azure word boundaries feature, I noticed something interesting about how the timestamps flow through the system. I'd like your perspective on this.

It then triggered questions in my head about timestamp propagation to clients.

I see that WordTTSService implementations (Azure, Cartesia, ElevenLabs) generate word timestamps with frame.pts, but when I looked at the RTVIObserver, it seems like these timestamps aren't being sent to clients in the bot-tts-text messages.

My use case: I'm trying to build karaoke-style synchronized subtitles where words highlight precisely as they're spoken. I also have UI elements that need to animate in sync with specific words in the audio (for example, highlighting parts of a diagram as the bot explains them, or updating visual indicators exactly when certain words are pronounced).

A few questions:

Is this intentional? Are word timestamps meant to stay server-side only, or is there a use case for clients to receive them?
How are Cartesia and ElevenLabs word boundaries currently being used? I'm curious if they have a different mechanism for exposing timestamps to clients, or if others have solved this kind of synchronization challenge differently?
For RTVI protocol: Would it make sense to optionally include timestamps in bot-tts-text messages for these types of synchronized UI use cases? Or is there already a pattern I'm missing for how clients should access this timing information?

I drafted a potential solution in https://github.com/remisharrock/pipecat/pull/3 that adds an optional timestamp field to maintain backward compatibility, but I wanted to understand the design intent first before proposing it.

What are your thoughts on this? Am I approaching this the right way, or is there a better pattern already in place that I should follow?

Oct 31 '25 20:10 remisharrock

@remisharrock apologies for the long delay. This is a really great writeup. Yes, you're right. Since Azure is managing the connection internally, I agree that the WordTTSService is the right subclass.

Also, as of 0.0.96, we've released a new client-side event called bot-output that allows you to build a karaoke style UI. You can see an example of that working here: https://github.com/pipecat-ai/pipecat-examples/tree/main/code-helper

I haven't looked at your PR in detail, but it seems like you're on the right track and it would be make sense to contribute to Pipecat. If you're up for it, please open a PR and I can try to get some time in the next week to take a look.

Again, thanks so much for putting time into this 🙇

Dec 06 '25 04:12 markbackman

@markbackman thank you for your kind reply. Yes I saw the 0.0.96 and was indeed very interested by the bot-output ! Let me investigate and get back to you, when I have a little bit of time.

Dec 06 '25 16:12 remisharrock

Feature Request: AzureTTSService should emit word boundaries via synthesis_word_boundary

Problem Statement

Proposed Solution

Alternative Solutions

Additional Context

Would you be willing to help implement this feature?