How can we get the position of text in the generated audio?

Open maifeeulasad opened this issue 3 weeks ago • 2 comments

It's really cool that we can now generate audio in realtime with microsoft/VibeVoice-Realtime-0.5B. I was thinking about integrating it to my application. And then I found a critical UX requirement, if we could highlight the text with the current audio that would be great.

Does vibe voice support this?

Dec 08 '25 16:12 maifeeulasad

Thank you for your interest. Currently, the model cannot provide alignment information between generated speech and text.

Dec 09 '25 01:12 wenhui0924

Okay!

If you guys are willing to work on this one. I would be happy to help. Please let me know.

Good day!

Dec 09 '25 03:12 maifeeulasad