Streaming implementation for Google TTS (Chirp3 HD)
Problem Statement
I am part of GCP sales engineering team and wanted to request integration of streaming module in TTS (Chirp3HD) code.
Right now latency in batch mode is way high for any reasonable voice to voice conversation.
Proposed Solution
The exact module is StreamingSynthesizeRequest and API details are here.
The TTS latency in streaming mode (~200ms) is way better than in batch mode (~800ms).
Alternative Solutions
No response
Additional Context
I have had got done similar update in Livekit and it works there. I would love to create a PR for this and contribute.
Would you be willing to help implement this feature?
- [x] Yes, I'd like to contribute
- [ ] No, I'm just suggesting
@markbackman @aconchillo - FYR please
Hello, I wanted to try this out and implemented the change. It is very fast indeed!
- I noticed slight quality degradation (everything else unchanged). Do you have an idea what could cause this? Sample rate is 24000.
- StreamingAudioConfig should support speaking_rate, but it says the parameter is unknown, maybe Pipecat uses an older version? Thank you!
Ach, I see, StreamingSynthesisInput doesn't support SSML, so we should keep the non-streaming version for SSML and older voices.
@aristid one option is to keep a streaming version and non-streaming version. We have other examples of this. For example CartesiaTTSService is the websocket API service and CartesiaHttpTTSService is the HTTP API service. We could follow a similar pattern for GoogleTTSService. That would be GoogleTTSService for streaming and GoogleHttpTTSService for non-streaming (or some other concise but descriptive name).
I have it working now, should I make a pull request? GoogleStreamingTTSService, to keep the http version working? It really is much faster, making using standard Gemini feasible instead of Gemini Live. The quality is a bit worse though, hope @manishkjs1 can say something about that.
should I make a pull request?
Yes, please! 🙌
GoogleStreamingTTSService, to keep the http version working? It really is much faster, making using standard Gemini feasible instead of Gemini Live. The quality is a bit worse though, hope @manishkjs1 can say something about that.
Our convention has been:
f"{service_name}TTSService"for the primary implementation, e.g.GoogleTTSServicef"{service_name}{description}TTSService"for the second implementation. To date, the description has beenHttp, soGoogleHttpTTSService
We can denote the change in the changelog.
what about the STT of the chirp 3?