RealtimeTTS
RealtimeTTS copied to clipboard
Unclear output with the CoquiEngine when using short input feed
Hi, For my usage I am feeding the engine the sentence word by word. using the SystemEngine I got a somewhat coherent sentence (the words were clear but the sentence was too fast), but when using the CoquiEngine the words became very unclear and I experienced pauses. I tried to up the buffer_threshold_seconds=7 but with no apparent improvement. any suggestions how can I improve the output? when feeding the engine complete sentences I gotten pretty good result, I am also using voice cloning, but this phenomenon persist with the default voice too. Thank you!
Sounds like you are feeding word by word into a playing stream, something which RealtimeTTS can't handle well currently. It can't safely determine sentence boundaries in this case. So it breaks the synthesis up into tiny fragments as they come in, which leads to performance and quality issues.
My suggestion would be to encapsulate the incoming words in a generator. That makes RealtimeTTS know that there may be future words to come soon and it does not immediately synthesize but waits for sentence boundaries:
import queue
import threading
class BufferStream:
def __init__(self):
self.items = queue.Queue()
self._stop_event = threading.Event()
def add(self, item: str) -> None:
self.items.put(item)
def stop(self):
self._stop_event.set()
def gen(self):
while not self._stop_event.is_set():
try:
yield self.items.get(timeout=0.1)
except queue.Empty:
continue
Add words to an instance of BufferStream and feed the gen method to RealtimeTTS.
(sorry, I know it's more complicated that it should be and I need to somehow integrate this into RealtimeTTS)
So would it be
class BufferStream: ...
buffer = BufferStream() engine = CoquiEngine() stream = TextToAudioStream(engine)
steam.feed(buffer.gen(content))
More like this (haven't tested but should work):
buffer = BufferStream()
engine = CoquiEngine()
stream = TextToAudioStream(engine)
buffer.add("Hello ")
buffer.add("World.")
steam.feed(buffer.gen())
stream.play_async()
buffer.add("More ")
buffer.add("words ")
buffer.add("to play.")
while stream.is_playing():
time.sleep(0.1)
Thanks, It does work, But having an issues with Coqui where openAI api stream will sometimes split words
t he house been fore closed last year.
"The house been foreclosed last year"
What can we do to only feed full words into the stream?
The OpenAI (and most other LLMs) text stream is coming in tokens, which can be only part of words. RealtimeTTS can work with that, it recognizes that because words have a blank space after them coming from the LLM while the splittet word parts have not. So if you just feed the tokens to RealtimeTTS as they come in, your good (that's also my private use case of the library, I feed LLM streams either directly as stream or with a buffer like shown above if I want to do processing before).
Yeah that would make more sense, but for some reason my output is splitting word, i am using oobabooga text-generation-webui with its OpenAI like api i get a big scramble.
Here is an example of my output.
The sun sets , ill umin ating the sky with vibr ant h ues before darkness prev ails . Output generated in 1.42 seconds (14.08 tokens/s, 20 tokens, context 15, seed 43474783)
And the backend Api code integration with RealtimeTTS
class BufferStream:
def __init__(self):
self.items = queue.Queue()
self._stop_event = threading.Event()
def add(self, item: str) -> None:
self.items.put(item)
def stop(self):
self._stop_event.set()
def gen(self):
while not self._stop_event.is_set():
try:
yield self.items.get(timeout=0.1)
except queue.Empty:
continue
bufferz = BufferStream()
engine = CoquiEngine()
stream = TextToAudioStream(engine)
def start_audio():
print('start audio fund')
# stream.play_async(muted=True, minimum_sentence_length = 3, minimum_first_fragment_length = 3,log_synthesized_text=True)
stream.play_async(muted=True)
print('played')
# while stream.is_playing():
# time.sleep(0.1)
@app.post('/v1/chat/completions_tts', response_model=ChatCompletionResponse, dependencies=check_key)
async def openai_chat_completions_tts(request: Request, request_data: ChatCompletionRequest, background_tasks: BackgroundTasks):
path = request.url.path
is_legacy = "/generate" in path
if request_data.stream:
responszzze = []
response_queue = asyncio.Queue()
first_feed_done = False
async def generator():
nonlocal first_feed_done
async with streaming_semaphore:
response = OAIcompletions.stream_chat_completions(to_dict(request_data), is_legacy=is_legacy)
for resp in response:
disconnected = await request.is_disconnected()
if disconnected:
break
if resp['choices'][0]['message']['role'] == "assistant":
content = resp['choices'][0]['message']['content']
print(content)
bufferz.add(content)
# stream.feed(content,request_data.tts_id)
if not first_feed_done:
stream.feed(bufferz.gen(),request_data.tts_id)
loop = asyncio.get_running_loop()
await loop.run_in_executor(None, start_audio)
first_feed_done = True
print('Tried to fire play')
yield {"data": json.dumps(resp)}
return EventSourceResponse(generator()) # SSE streaming
I think it has something to do with how BufferStream is handling commas and periods.
Don't know ooba that well but if it's output really loses the word spaces somehow I think you should focus on finding out why that happens. Not trying to put that together again somehow afterwards.
It does keep the spaces, i was able to get parts outputted clearly using BufferStream problem is how its queue it i think it has a problem with commas and periods.
I added
def clear(self):
while not self.items.empty():
self.items.get()
self.items.task_done()
and i run
bufferz.clear() before each API call
but my text still gets scrambled up with previous responses when using BufferStream class
class BufferStream:
def __init__(self):
self.items = queue.Queue()
self._stop_event = threading.Event()
def add(self, item: str) -> None:
# Filter out or modify unwanted content
if self.should_filter(item):
return # Skip adding this item
self.items.put(item)
def should_filter(self, item: str) -> bool:
# Define your filtering logic here
# For example, skip if item is just a comma, period, or quotes
if item in {',', '.', '"', "'"}:
return True
return False
def stop(self):
self._stop_event.set()
def gen(self):
while not self._stop_event.is_set():
try:
yield self.items.get(timeout=0.1)
except queue.Empty:
continue
def clear(self):
while not self.items.empty():
self.items.get()
self.items.task_done()
I added the following to handle commas and periods and quotes it works ok but i don't think this would be the best solution for production.