MiniCPM-V optimize audio generation latency for real-time WebRTC applications

i went through this issue but couldn't get much answers.

are there some ways to bring down the latency when generate_audio=True? i'm building a real-time speech-to-speech app with webrtc and the 0.6-1.0 sec latency with generate_audio=true is too slow for my needs, especially because every response contains the audio with roughly 12800 samples, which is (533.33 ms at 24khz), and if generation latency is more than this, it causes jitters.

any tips to make it faster? maybe a different tts model or some parameter tweaks? or are there bottlenecks in the implementation i should know about?

really need to get this working with lower latency for my use case.

Feb 25 '25 10:02 biraj-outspeed

Hello biraj! Thank you for your feedback, which is very important.

The audio generation's latency has two concept:

initial latency, the delay between end of user question and the first audio output chunk. It is around 1.5s.
realtime factor, once the audio begin to generate, how much time is cost to generate 1s output audio. It is around 0.6s on A100 and 4090. (it also means it cost 0.3s to generate 0.5s output audio). So if it uses less than 1s to generate 1s output audio, the audio generation will never pause.

So, do you mean that every audio output chunk (533.33ms audio) needs 0.6-1.0s to generate? If you are using 4090 or A100, it should only cost 0.3s to generate 0.5s output audio. I would like to know the device you are using, to investigate what is happening, thank you!

Feb 26 '25 04:02 bokesyo

Can I ask about your usage? Our understanding is that the output of omni only needs to be faster than speaking to meet the needs. Simply, the time to generate one second of speech is less than one second (including various times such as web), then the audio effect is complete. You can share more data or details so that we can help you solve the problem.

Feb 26 '25 06:02 tc-mb

@tc-mb @bokesyo Yes, we see it takes 0.3s to generate 0.5s of audio on H100 (not A100).

Any ideas on how we can reduce the latency between user input finish and first audio output?

Feb 26 '25 07:02 janak2

The latency you asked about can be reduced through acceleration, but our current latency should be 2.5-3s, which should be similar to other models. Why do you want to respond faster? And how fast do you want to speed up?

Feb 26 '25 07:02 tc-mb

@tc-mb GPT-4o voice has a TTFB of 300ms and Moshi has a TTFB of 600ms. 2-3s for TTFB is too high for a natural conversation. Can you please explain what you mean by acceleration?

Feb 26 '25 19:02 janak2

We have actually compared similar products, and they are basically 2-3 seconds. In terms of data, is the time you are talking about the test time of such products? In our settings, the time used to judge the user's completion is set to 0.8-1 second. This is too short and will obviously be confused with the user's normal speaking pause. Our open source part should be close to the product. The advantage of this is that users can easily deploy and use it directly without relying on other secondary packaging. If you want to splice modules to other models yourself, you can refer to our code and disassemble it again.

Feb 27 '25 03:02 tc-mb

@tc-mb @bokesyo Yes, we see it takes 0.3s to generate 0.5s of audio on H100 (not A100).

Any ideas on how we can reduce the latency between user input finish and first audio output?

@janak2 Yes, here are some suggestions:

Check datatype, make tts bf16 may accelerate, did you use tts.float()? you can use torch.autocast to wrap the model to see if it is faster.
Check this issue: https://github.com/OpenBMB/MiniCPM-o/issues/845 (you can use a translator) the author made an improvement, making first response time half, because we implemeted a merge between two audio chunks, so actually we return the first audio chunk until the second audio chunk finished, which is not a good practice, the guy has changed the logic to make the initial response time half.
We used a VAD module with a threshold 500ms, so you can reduce the VAD threshold to 200ms, to further lower down the first response time. In this case, the model sometimes response when the user not finish asking.
Make tts decode less audio token when it return the first audio chunk. Currently we use 25 tokens (~500ms of audio), but you can further reduce it to 12. In this case, the audio output may have lower quality but further faster.

Feb 27 '25 08:02 bokesyo