LiveTalking icon indicating copy to clipboard operation
LiveTalking copied to clipboard

I have a TTS implementation that's rather slow

Open jesulo opened this issue 4 months ago • 11 comments

I have a TTS implementation that's rather slow, so the voice comes out with noise and is choppy. Can I create a buffer? Or how can I modify the put_audio_frame so it doesn't fill up with chunks of silence? Regards.

jesulo avatar Aug 11 '25 19:08 jesulo

From the perspective of real-time digital humans, if your TTS synthesis is slow, shouldn't you solve the TTS problem? Instead of adding buffering, perhaps you can cache the TTS when it is generated, and then stream it to the digital human model for rendering after it is generated.

heyyyyou avatar Aug 12 '25 07:08 heyyyyou

From the perspective of real-time digital humans, if your TTS synthesis is slow, shouldn't you solve the TTS problem? Instead of adding buffering, perhaps you can cache the TTS when it is generated, and then stream it to the digital human model for rendering after it is generated.

It's a factor of 1.7x so it generates almost double the audio, but just as the system processes the chunks in less than 10 ms and adds chunks of silence, that affects the playback.

jesulo avatar Aug 12 '25 09:08 jesulo

@heyyyyou any ideas on how to modify lipasr to wait while there is a tts stream running, and not carry on with chunks of silence? regards

jesulo avatar Aug 12 '25 20:08 jesulo

@heyyyyou any ideas on how to modify lipasr to wait while there is a tts stream running, and not carry on with chunks of silence? regards

Through unified state management and timing control, we try to solve the ASR waiting problem and the audio generation speed mismatch problem during TTS operation. Timing compensation is performed at the ASR processing level, rather than in the subsequent playback stage. This is roughly what I understand. Sorry, I have not encountered this problem and do not have a specific solution.

heyyyyou avatar Aug 13 '25 08:08 heyyyyou

You can't solve the silence problem by blocking and waiting, because ASR must maintain continuous frame consumption to maintain the smoothness of the entire real-time processing pipeline. Blocking in the run_step() method of LipASR to wait for the real audio block will cause the entire audio processing pipeline to freeze, destroying the continuity of avatar rendering. lipasr.py:31-38 Is it blocking the ASR main processing loop when trying to skip silent frames (type=1) and only wait for real audio (type=0)? This can cause some issues: Blocking downstream queues: When get_audio_frame() in baseasr.py:56-70 is blocked, the entire audio processing chain stalls. Destroying rendering synchronization: The main rendering loop in lipreal.py:241-242 relies on continuous ASR output; blocking it can cause avatar rendering to be choppy. Queue backlog: Your log shows sleep qsize=19, indicating that the queue management mechanism is triggering buffer control due to ASR blocking.

I believe selective processing should be used instead of blocking. In the run_step() method of lipasr.py, adopt a "consistently consume, selectively process" strategy: lipasr.py:31-64 A few thoughts: Keep the queue flowing: Always consume batch_size*2 audio frames to ensure a continuous flow in the output_queue. Smart feature extraction: Only calculate the mel spectrogram when real speech (type=0) is detected. State-aware processing: Detect the TTS running state using self.parent.speaking and skip feature extraction during speaking. Perhaps you could try using a cursor or ask Claude for help understanding the code and providing a solution.

heyyyyou avatar Aug 14 '25 02:08 heyyyyou

I did it this way, but it still doesn't work. I consulted with artificial intelligences, but they don't understand the flow and are not helpful. def run_step(self): ############################################## extract audio feature ############################################## # get a frame of audio frames_to_process_for_model_count = 0 target_frames_for_model = self.batch_size * 2

    while frames_to_process_for_model_count < target_frames_for_model:
        frame, audio_type, eventpoint = self.get_audio_frame()
        
        if audio_type != 1:
            self.frames.append(frame)
            self.output_queue.put((frame, audio_type, eventpoint))
            frames_to_process_for_model_count += 1
    # context not enough, do not run network.
    if len(self.frames) <= self.stride_left_size + self.stride_right_size:
        return
     
    inputs = np.concatenate(self.frames) # [N * chunk]
    mel = audio.melspectrogram(inputs)
    #print(mel.shape[0],mel.shape,len(mel[0]),len(self.frames))
    # cut off stride

jesulo avatar Aug 14 '25 11:08 jesulo

The while loop only increments the counter when audio_type != 1, causing it to block indefinitely when encountering consecutive silent frames, violating the fundamental design principles of the LiveTalking audio processing pipeline.

In the system design, get_audio_frame() generates silent frames (type=1) when the queue is empty to maintain processing continuity. Attempts:

Always consume a fixed number of frames: Maintain a frame consumption cadence of batch_size * 2

Selective processing: Only add real audio frames to self.frames for feature extraction

Keep the queue flowing: All frames must be placed in output_queue

def run_step(self):
    has_real_audio = False
    # Always consume a fixed number of frames, regardless of type
    for _ in range(self.batch_size * 2):
        frame, audio_type, eventpoint = self.get_audio_frame()
        self.output_queue.put((frame, audio_type, eventpoint))
        
        # Only non-silent frames and non-TTS playback are added to the processing queue
        if audio_type != 1 and not (self.parent and self.parent.speaking):
            self.frames.append(frame)
            has_real_audio = True
    
    # Only perform feature extraction when there is real audio
    if has_real_audio and len(self.frames) > self.stride_left_size + self.stride_right_size:
        inputs = np.concatenate(self.frames)
        mel = audio.melspectrogram(inputs)
        # Continue with the original mel processing logic...

and:I think it might be helpful to open this project with a cursor and let the cursor analyze the project in depth and ask questions.

heyyyyou avatar Aug 15 '25 02:08 heyyyyou

I'll try that method, do you have a contact email?

El El jue, 14 ago. 2025 a la(s) 11:50 p. m., heyyyyou < @.***> escribió:

heyyyyou left a comment (lipku/LiveTalking#514) https://github.com/lipku/LiveTalking/issues/514#issuecomment-3190458099

The while loop only increments the counter when audio_type != 1, causing it to block indefinitely when encountering consecutive silent frames, violating the fundamental design principles of the LiveTalking audio processing pipeline.

In the system design, get_audio_frame() generates silent frames (type=1) when the queue is empty to maintain processing continuity. Attempts:

Always consume a fixed number of frames: Maintain a frame consumption cadence of batch_size * 2

Selective processing: Only add real audio frames to self.frames for feature extraction

Keep the queue flowing: All frames must be placed in output_queue

def run_step(self): has_real_audio = False # Always consume a fixed number of frames, regardless of type for _ in range(self.batch_size * 2): frame, audio_type, eventpoint = self.get_audio_frame() self.output_queue.put((frame, audio_type, eventpoint))

    # Only non-silent frames and non-TTS playback are added to the processing queue
    if audio_type != 1 and not (self.parent and self.parent.speaking):
        self.frames.append(frame)
        has_real_audio = True

# Only perform feature extraction when there is real audio
if has_real_audio and len(self.frames) > self.stride_left_size + self.stride_right_size:
    inputs = np.concatenate(self.frames)
    mel = audio.melspectrogram(inputs)
    # Continue with the original mel processing logic...

and:I think it might be helpful to open this project with a cursor and let the cursor analyze the project in depth and ask questions.

— Reply to this email directly, view it on GitHub https://github.com/lipku/LiveTalking/issues/514#issuecomment-3190458099, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJYSJYDZJWDWJMZKMKFMTDD3NVDGJAVCNFSM6AAAAACDULDC56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCOJQGQ2TQMBZHE . You are receiving this because you authored the thread.Message ID: @.***>

jesulo avatar Aug 15 '25 10:08 jesulo

has_real_audio = False # Always consume a fixed number of frames, regardless of type for _ in range(self.batch_size * 2): frame, audio_type, eventpoint = self.get_audio_frame() self.output_queue.put((frame, audio_type, eventpoint))

    # Only non-silent frames and non-TTS playback are added to the processing queue
    if audio_type != 1 and not (self.parent and self.parent.speaking):
        self.frames.append(frame)
        has_real_audio = True

# Only perform feature extraction when there is real audio
if has_real_audio and len(self.frames) > self.stride_left_size + self.stride_right_size:
    inputs = np.concatenate(self.frames)
    mel = audio.melspectrogram(inputs)

With this code, the avatar doesn't speak and not produce lip movements. I've already tried Cursor, but even it doesn't know how to create a buffer or some code so that the audio plays continuously without the whole system freezing, cursor created a code but not worked too.

jesulo avatar Aug 16 '25 16:08 jesulo

do you know deepwiki?maybe it can help you。i dont have any idea for you

heyyyyou avatar Aug 17 '25 08:08 heyyyyou

I am able generate NRT (near realtime). TTS produce 7x RT factor +stream. but still inside the app threads are slowing down.

I am testing here -> https://akawhipsrv.azurewebsites.net/index.html

https://github.com/lipku/LiveTalking/issues/527#issuecomment-3226544921

karayakar avatar Aug 28 '25 00:08 karayakar