Feature Request: AzureTTSService should emit word boundaries via synthesis_word_boundary
Problem Statement
Currently, Pipecat’s AzureTTSService does NOT emit word boundaries, even though the Azure SDK supports this via the synthesis_word_boundary event.
Proposed Solution
Subscribe to the synthesis_word_boundary event in AzureTTSService.start()
- Handle each word boundary event and emit a
TTSTextFramewith word-level text and the corresponding audio offset (timing) - Use
evt.textfor the word andevt.audio_offset(divide by 10000 for ms) for synchronization
Example implementation:
def _handle_word_boundary(self, evt):
# evt.text contains the word
# evt.audio_offset contains timing (in ticks, divide by 10000 for ms)
pass
async def start(self, frame: StartFrame):
await super().start(frame)
# ... existing setup code ...
self._speech_synthesizer.synthesis_word_boundary.connect(self._handle_word_boundary)
Alternative Solutions
No response
Additional Context
References:
Would you be willing to help implement this feature?
- [x] Yes, I'd like to contribute
- [ ] No, I'm just suggesting
@remisharrock if you're willing to contribute, that would be really great!
You would need to change the base class to InterruptibleWordTTSService and handle the word calculation. There are examples of how this is done in Cartesia, ElevenLabs, and Rime. Please tag me on the PR if you get time to work on this.
Dear @markbackman, I investigated this issue with my copilot partner as you can see.
I noticed you initially suggested using InterruptibleWordTTSService as the base class. However, after studying the Azure SDK architecture and the existing implementations, it looks like it's better to use WordTTSService instead, but I want your opinion on that.
What I understood with my copilot is that, unlike Cartesia, ElevenLabs, and Rime which require explicit WebSocket connection management, the Azure Speech SDK:
- Handles all WebSocket connections internally via
SpeechSynthesizer - Provides high-level callbacks (
synthesis_word_boundary,synthesizing,synthesis_completed) - Does not expose or require manual WebSocket handling
Why InterruptibleWordTTSService doesn't fit:
InterruptibleWordTTSService requires implementing abstract methods like:
-
_connect_websocket() -
_disconnect_websocket() -
_receive_messages() -
_connect() -
_disconnect()
These methods don't make sense for Azure because we never directly interact with the WebSocket - it's completely abstracted away by the SDK.
So the attempted implementation:
- Uses
WordTTSServicewhich provides word timestamp support without requiring WebSocket abstractions - Azure SDK callbacks are synchronous, so it uses
put_nowait()directly to add timestamps, do you also think it's a good practice ?
Let me know if you think InterruptibleWordTTSService is still necessary for other reasons, or if this approach works for the main repository ?
all that is explained in my fork and in this pull request: https://github.com/remisharrock/pipecat/pull/2
Something else:
While working on the Azure word boundaries feature, I noticed something interesting about how the timestamps flow through the system. I'd like your perspective on this.
It then triggered questions in my head about timestamp propagation to clients.
I see that WordTTSService implementations (Azure, Cartesia, ElevenLabs) generate word timestamps with frame.pts, but when I looked at the RTVIObserver, it seems like these timestamps aren't being sent to clients in the bot-tts-text messages.
My use case: I'm trying to build karaoke-style synchronized subtitles where words highlight precisely as they're spoken. I also have UI elements that need to animate in sync with specific words in the audio (for example, highlighting parts of a diagram as the bot explains them, or updating visual indicators exactly when certain words are pronounced).
A few questions:
-
Is this intentional? Are word timestamps meant to stay server-side only, or is there a use case for clients to receive them?
-
How are Cartesia and ElevenLabs word boundaries currently being used? I'm curious if they have a different mechanism for exposing timestamps to clients, or if others have solved this kind of synchronization challenge differently?
-
For RTVI protocol: Would it make sense to optionally include timestamps in
bot-tts-textmessages for these types of synchronized UI use cases? Or is there already a pattern I'm missing for how clients should access this timing information?
I drafted a potential solution in https://github.com/remisharrock/pipecat/pull/3 that adds an optional timestamp field to maintain backward compatibility, but I wanted to understand the design intent first before proposing it.
What are your thoughts on this? Am I approaching this the right way, or is there a better pattern already in place that I should follow?
@remisharrock apologies for the long delay. This is a really great writeup. Yes, you're right. Since Azure is managing the connection internally, I agree that the WordTTSService is the right subclass.
Also, as of 0.0.96, we've released a new client-side event called bot-output that allows you to build a karaoke style UI. You can see an example of that working here:
https://github.com/pipecat-ai/pipecat-examples/tree/main/code-helper
I haven't looked at your PR in detail, but it seems like you're on the right track and it would be make sense to contribute to Pipecat. If you're up for it, please open a PR and I can try to get some time in the next week to take a look.
Again, thanks so much for putting time into this 🙇
@markbackman thank you for your kind reply. Yes I saw the 0.0.96 and was indeed very interested by the bot-output ! Let me investigate and get back to you, when I have a little bit of time.