MiniCPM-V Inquiry About Latency of streaming generate with audio in MiniCPMo Web Demo

Thank you for your excellent work on MiniCPMo and its integration with ChatTTS! I have been exploring the web demo and am very impressed by its capabilities.

I would like to ask about the latency of ChatTTS in the web demo. I noticed that in the streaming_generate function, setting generate_audio=False results in ~0.16 sec latency per chunk received. However, enabling generate_audio=True increases the latency significantly to ~0.6-1.0 sec. Is there a reason for this slow performance, and are there any optimizations to improve it?

Thank you for your time, and I look forward to your response!

Jan 23 '25 03:01 tienanh28122000

Hi! Thank you for using MiniCPM-o 2.6, I checked the code on huggingface, if I understand your question correctly, streaming_generate=True means to generate audio, and streaming_generate=False will not generate audio. The difference is because the TTS generate the first audio chunk from the first text chunk.

Maybe this line could helps: https://huggingface.co/openbmb/MiniCPM-o-2_6/blob/4a25f999c53b8a51b5ef4ccca45c8a5e59d06a7e/modeling_minicpmo.py#L1216

Jan 23 '25 08:01 bokesyo

hey @bokesyo, thanks for always being helpful. can you suggest some ways to bring down the latency when generate_audio=True? i'm building a real-time speech-to-speech app with webrtc and the 0.6-1.0 sec latency with generate_audio=true is too slow for my needs, especially because every response contains the audio with roughly 12800 samples, which is (533.33 ms at 24khz), and if generation delay is more than this, it causes jitters.

any tips to make it faster? maybe a different tts model or some parameter tweaks? or are there bottlenecks in the implementation i should know about?

really need to get this working with lower latency for my use case.

Feb 25 '25 10:02 biraj-outspeed

Hello biraj! Thank you for your feedback, which is very important.

The audio generation's latency has two concept:

initial latency, the delay between end of user question and the first audio output chunk. It is around 1.5s.
realtime factor, once the audio begin to generate, how much time is cost to generate 1s output audio. It is around 0.6s on A100 and 4090. (it also means it cost 0.3s to generate 0.5s output audio). So if it uses less than 1s to generate 1s output audio, the audio generation will never pause.

So, do you mean that every audio output chunk (533.33ms audio) needs 0.6-1.0s to generate? If you are using 4090 or A100, it should only cost 0.3s to generate 0.5s output audio. I would like to know the device you are using, to investigate what is happening, thank you!

Feb 26 '25 04:02 bokesyo