TTS icon indicating copy to clipboard operation
TTS copied to clipboard

[Feature request] Incremental TTS

Open Runtrons opened this issue 1 year ago β€’ 3 comments

Hello, I'm currently developing a Voice Assistant aimed at providing seamless and real-time interactions for various applications. I've encountered a common yet significant challenge pertaining to the trade-off between responsiveness and quality (As I am sure you have dealt with too), particularly in the context of large language models.

In traditional setups, the entire input must be processed before the TTS system can generate output, leading to latency that makes the difference between usable and unusable assistants. To address this, I propose the integration of Incremental Text-to-Speech (also known as Streaming TTS) .

Incremental TTS allows the TTS system to start vocalizing content without requiring the entire input text upfront. This process significantly enhances the response speed, making the interaction feel more natural and akin to human conversation.

The advantages of Incremental TTS is that it can drastically reduce waiting time for the user. I believe that integrating Incremental TTS will be a significant step forward in improving the overall performance and user satisfaction of coqui. I look forward to discussing this further and exploring how we could possibly implement this.

Runtrons avatar Dec 13 '23 00:12 Runtrons

Before processing, the program already performs sentence splitting. It would probably be possible to pass a callback in that, after each individual split sentence has been processed, could return that.

FlorianEagox avatar Dec 13 '23 03:12 FlorianEagox

XTTS already can do streaming. You can find an example of how to use it here . It consists of a server and it also has an example that shows how to do the inference using the server: https://github.com/coqui-ai/xtts-streaming-server/blob/main/test/test_streaming.py

Edresson avatar Dec 13 '23 14:12 Edresson

Thanks for the reply. This is definitely helpful. The thing I am looking for the most is 'text_splitting' but making it so I can stream my text to the model. In other words, make it so I can have the input text start speaking, as it is being created. In the inference example I did not see this. What part of the repo deals with text_splitting? How does it? Is the model able to retain context?

Runtrons avatar Dec 13 '23 21:12 Runtrons

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

stale[bot] avatar Jan 14 '24 09:01 stale[bot]