RealtimeTTS icon indicating copy to clipboard operation
RealtimeTTS copied to clipboard

Unclear output with the CoquiEngine when using short input feed

Open yuvalBGU1 opened this issue 1 year ago • 10 comments

Hi, For my usage I am feeding the engine the sentence word by word. using the SystemEngine I got a somewhat coherent sentence (the words were clear but the sentence was too fast), but when using the CoquiEngine the words became very unclear and I experienced pauses. I tried to up the buffer_threshold_seconds=7 but with no apparent improvement. any suggestions how can I improve the output? when feeding the engine complete sentences I gotten pretty good result, I am also using voice cloning, but this phenomenon persist with the default voice too. Thank you!

yuvalBGU1 avatar Dec 31 '23 07:12 yuvalBGU1

Sounds like you are feeding word by word into a playing stream, something which RealtimeTTS can't handle well currently. It can't safely determine sentence boundaries in this case. So it breaks the synthesis up into tiny fragments as they come in, which leads to performance and quality issues.

My suggestion would be to encapsulate the incoming words in a generator. That makes RealtimeTTS know that there may be future words to come soon and it does not immediately synthesize but waits for sentence boundaries:


import queue
import threading

class BufferStream:
    def __init__(self):
        self.items = queue.Queue()
        self._stop_event = threading.Event()

    def add(self, item: str) -> None:
        self.items.put(item)

    def stop(self):
        self._stop_event.set()

    def gen(self):
        while not self._stop_event.is_set():
            try:
                yield self.items.get(timeout=0.1)
            except queue.Empty:
                continue

Add words to an instance of BufferStream and feed the gen method to RealtimeTTS.

(sorry, I know it's more complicated that it should be and I need to somehow integrate this into RealtimeTTS)

KoljaB avatar Dec 31 '23 11:12 KoljaB

So would it be

class BufferStream: ...

buffer = BufferStream() engine = CoquiEngine() stream = TextToAudioStream(engine)

steam.feed(buffer.gen(content))

mercuryyy avatar Jan 01 '24 07:01 mercuryyy

More like this (haven't tested but should work):

buffer = BufferStream()
engine = CoquiEngine()
stream = TextToAudioStream(engine)

buffer.add("Hello ")
buffer.add("World.")

steam.feed(buffer.gen())
stream.play_async()

buffer.add("More ")
buffer.add("words ")
buffer.add("to play.")

while stream.is_playing():
    time.sleep(0.1)

KoljaB avatar Jan 01 '24 11:01 KoljaB

Thanks, It does work, But having an issues with Coqui where openAI api stream will sometimes split words

t he house been fore closed last year.

"The house been foreclosed last year"

What can we do to only feed full words into the stream?

mercuryyy avatar Jan 01 '24 14:01 mercuryyy

The OpenAI (and most other LLMs) text stream is coming in tokens, which can be only part of words. RealtimeTTS can work with that, it recognizes that because words have a blank space after them coming from the LLM while the splittet word parts have not. So if you just feed the tokens to RealtimeTTS as they come in, your good (that's also my private use case of the library, I feed LLM streams either directly as stream or with a buffer like shown above if I want to do processing before).

KoljaB avatar Jan 01 '24 15:01 KoljaB

Yeah that would make more sense, but for some reason my output is splitting word, i am using oobabooga text-generation-webui with its OpenAI like api i get a big scramble.

Here is an example of my output.

The sun sets , ill umin ating the sky with vibr ant h ues before darkness prev ails . Output generated in 1.42 seconds (14.08 tokens/s, 20 tokens, context 15, seed 43474783)

And the backend Api code integration with RealtimeTTS

class BufferStream:
    def __init__(self):
        self.items = queue.Queue()
        self._stop_event = threading.Event()

    def add(self, item: str) -> None:
        self.items.put(item)

    def stop(self):
        self._stop_event.set()

    def gen(self):
        while not self._stop_event.is_set():
            try:
                yield self.items.get(timeout=0.1)
            except queue.Empty:
                continue

bufferz = BufferStream()
engine = CoquiEngine()
stream = TextToAudioStream(engine)

def start_audio():
    print('start audio fund')
    # stream.play_async(muted=True, minimum_sentence_length = 3, minimum_first_fragment_length = 3,log_synthesized_text=True)
    stream.play_async(muted=True)
    print('played')
    # while stream.is_playing():
    #    time.sleep(0.1)

@app.post('/v1/chat/completions_tts', response_model=ChatCompletionResponse, dependencies=check_key)
async def openai_chat_completions_tts(request: Request, request_data: ChatCompletionRequest, background_tasks: BackgroundTasks):
    path = request.url.path
    is_legacy = "/generate" in path

    if request_data.stream:
        responszzze = []
        response_queue = asyncio.Queue()
        first_feed_done = False
        async def generator():
            nonlocal first_feed_done
            async with streaming_semaphore:
                response = OAIcompletions.stream_chat_completions(to_dict(request_data), is_legacy=is_legacy)
                for resp in response:
                    disconnected = await request.is_disconnected()
                    if disconnected:
                        break
                    if resp['choices'][0]['message']['role'] == "assistant":
                        content = resp['choices'][0]['message']['content']
                        print(content)
                        bufferz.add(content)
                        # stream.feed(content,request_data.tts_id)
                        if not first_feed_done:
                            stream.feed(bufferz.gen(),request_data.tts_id)
                            loop = asyncio.get_running_loop()
                            await loop.run_in_executor(None, start_audio)
                            first_feed_done = True
                            print('Tried to fire play')
                    yield {"data": json.dumps(resp)}

        return EventSourceResponse(generator())  # SSE streaming

mercuryyy avatar Jan 01 '24 16:01 mercuryyy

I think it has something to do with how BufferStream is handling commas and periods.

mercuryyy avatar Jan 01 '24 16:01 mercuryyy

Don't know ooba that well but if it's output really loses the word spaces somehow I think you should focus on finding out why that happens. Not trying to put that together again somehow afterwards.

KoljaB avatar Jan 01 '24 16:01 KoljaB

It does keep the spaces, i was able to get parts outputted clearly using BufferStream problem is how its queue it i think it has a problem with commas and periods.

I added

def clear(self):
    while not self.items.empty():
        self.items.get()
        self.items.task_done()

and i run

bufferz.clear() before each API call

but my text still gets scrambled up with previous responses when using BufferStream class

mercuryyy avatar Jan 01 '24 16:01 mercuryyy

class BufferStream:
    def __init__(self):
        self.items = queue.Queue()
        self._stop_event = threading.Event()

    def add(self, item: str) -> None:
        # Filter out or modify unwanted content
        if self.should_filter(item):
            return  # Skip adding this item
        self.items.put(item)

    def should_filter(self, item: str) -> bool:
        # Define your filtering logic here
        # For example, skip if item is just a comma, period, or quotes
        if item in {',', '.', '"', "'"}:
            return True
        return False

    def stop(self):
        self._stop_event.set()

    def gen(self):
        while not self._stop_event.is_set():
            try:
                yield self.items.get(timeout=0.1)
            except queue.Empty:
                continue

    def clear(self):
        while not self.items.empty():
            self.items.get()
            self.items.task_done()

I added the following to handle commas and periods and quotes it works ok but i don't think this would be the best solution for production.

mercuryyy avatar Jan 01 '24 16:01 mercuryyy