VibeVoice Is there a way to speed up the generation process?

I generated some podcasts using the 7B model, and the results were indeed very impressive. However, the main issue is that the generation is extremely slow. Is there any way to speed up the generation process? I noticed that GPU utilization is not very high, around 2.4 it/s.

Speaker 1: 大家好，欢迎收听 VibeVoice 播客频道。我是你们的主持人，Linda。
Speaker 2: 我是 David，很高兴今天能和 Linda 一起探讨一个科技界的重磅新闻。
Speaker 1: 是的，今天我们要聊的话题和每一位安卓用户都息息相关。谷歌最近宣布了一项重大政策调整，计划从明年开始，将禁止通过“旁加载”（sideloading）的方式安装未经验证的安卓应用。 David，给我们简单解释一下，什么是“旁加载”？
Speaker 2: 当然。简单来说，“旁加载”就是指从官方的 Google Play 商店之外的渠道下载和安装应用程序，比如直接从网站下载 APK 文件。 这一直是安卓系统区别于苹果 iOS 的一个主要特点，给了用户更多的自由度。
Speaker 1: 没错，但现在谷歌似乎要收紧这个“自由”。这项新政策具体是怎么回事呢？
Speaker 2: 根据谷歌的公告，从 2026 年开始，在经过认证的安卓设备上，用户将只能旁加载来自经过官方验证的开发者的应用。 也就是说，那些匿名的、未经身份验证的开发者开发的 App，将无法再通过旁加载的方式安装。
Speaker 1: 听起来是个不小的变动。这个政策的推出有时间表吗？
Speaker 2: 有的。谷歌计划在今年 10 月开始进行测试，并在 2026 年 3 月向所有开发者开放新的验证平台。 到了 2026 年 9 月，这项政策将在巴西、印度尼西亚、新加坡和泰国等国家率先推行，并计划在 2027 年进行全球推广。
Speaker 1: 那么，谷歌为什么要这么做呢？官方给出的理由是什么？
Speaker 2: 谷歌强调，此举主要是为了提升安卓生态系统的安全性。 他们的数据显示，来自互联网旁加载的应用带来的恶意软件数量，是 Google Play 商店应用的 50 倍以上。 通过验证开发者身份，谷歌希望能更好地保护用户，免受恶意软件和网络诈骗的侵害。
Speaker 1: 从安全角度来看，这听起来确实合情合理。毕竟，谁也不想自己的手机被恶意软件攻击。但是，这项政策在开发者和用户社群中似乎也引发了不少争议，对吧？
Speaker 2: 的确如此。最大的担忧在于，这会让安卓系统变得越来越封闭，越来越像苹果的“围墙花园”。 很多用户和开发者选择安卓，就是看中了它的开放性。 此举被一些人视为谷歌在加强对应用分发渠道的控制，尤其是在他们面临反垄断诉讼的背景下。
Speaker 1: 我明白了。一方面是出于安全考虑，另一方面又被质疑是想加强控制。那么，这项新政对普通用户和开发者具体会产生哪些影响呢？
Speaker 2: 对普通用户来说，尤其是那些不太懂技术的用户，这无疑增加了一层安全保障，降低了他们不小心安装恶意软件的风险。 但对于那些喜欢尝试各种应用、依赖旁加载来获取特定功能（比如某些广告拦截应用或开源软件）的“高级用户”来说，他们的选择自由度无疑会受到限制。
Speaker 1: 那对开发者呢？特别是那些个人开发者或者小型团队。
Speaker 2: 这是另一个争议的焦点。虽然谷歌表示，验证过程主要是为了确认开发者的身份，并不会审查应用的内容，但开发者仍然需要提供姓名、地址、商业信息甚至政府颁发的身份证件等个人信息来进行验证。 这对于一些注重隐私的独立开发者或开源项目来说，可能会构成一个不小的障碍。 人们也担心，未来谷歌可能会逐步提高验证的门槛。
Speaker 1: 确实，这背后牵涉到安全、自由和控制权的复杂平衡。谷歌这次的政策调整，可以说是安卓生态系统发展的一个重要转折点。
Speaker 2: 完全正确。如何在保护绝大多数用户的安全和维护安卓系统的开放性之间找到一个最佳平衡点，将是谷歌未来需要面对的巨大挑战。这也将深刻影响未来安卓与 iOS 两大移动生态系统的竞争格局。
Speaker 1: 好的，关于谷歌封锁未经验证应用的话题，我们今天就先聊到这里。感谢 David 的精彩分析。
Speaker 2: 谢谢 Linda。
Speaker 1: 也感谢各位的收听。请订阅 VibeVoice 频道，我们会为您带来更多有趣的深度解读。下次再见！

https://github.com/user-attachments/files/22002447/audio.5.wav

🎙️ Generating podcast with 2 speakers
📊 Parameters: CFG Scale=1.3, Inference Steps=10
🎭 Speakers: zh-Xinran_woman, zh-Bowen_man
📝 Formatted script with 21 turns

🔄 Processing with VibeVoice (streaming mode)...
⏱️ Generation completed in 903.81 seconds
🎵 Final audio duration: 263.47 seconds
📊 Total chunks: 1976
✨ Generation successful! Complete audio is ready in the 'Complete Audio' tab.
💡 Not satisfied? You can regenerate or adjust the CFG scale for different results.

Speaker 1: Hello everyone, and welcome to the VibeVoice podcast channel. I'm your host, Linda, and today I want to share some very interesting and authentic Chinese expressions with you.
Speaker 1: In Chinese, when you want to say something is super easy, just a simple task, you can use the phrase "小菜一碟". It literally means "a small dish of food", but it means "a piece of cake". For example, if you want to say, "Adding and subtracting three-digit numbers is a piece of cake for me", you can say.
Speaker 1: 三位数的加减法对我来说小菜一碟.
Speaker 1: The next phrase we’re going to learn is “你开玩笑吧”. It's a very common way to express disbelief, like "Are you kidding me?" or "You must be joking". For instance, when you hear an unbelievable piece of news such as your friend brought a T-shirt using 5000 dollars, you can say,
Speaker 1: 你开玩笑吧, 你花五千块钱买了一件衣服.
Speaker 1: Next, let's learn a phrase for when you suddenly understand something, like a "lightbulb moment". In Chinese, you can say "恍然大悟". It means you suddenly "see the light". For example, when you finally grasp a difficult math concept that has confused you for days, you can say.
Speaker 1: 我困惑这个公式好几天了, 但现在我恍然大悟, 终于明白了.
Speaker 1: For our last one, when you want to say something is super easy, you can use a very vivid phrase: "闭着眼睛都能做". It literally means "can do it with one's eyes closed". For example, if you want to say, "He can use this software with his eyes closed", you can say.
Speaker 1: 这个软件他闭着眼都能用."
Speaker 1: Well, that’s all the time we have for today. Thank you for listening. Please subscribe to VibeVoice, where we share all the interesting things in this world with you.

https://github.com/user-attachments/files/22002458/audio.4.wav

🎙️ Generating podcast with 2 speakers
📊 Parameters: CFG Scale=1.3, Inference Steps=10
🎭 Speakers: zh-Xinran_woman, zh-Bowen_man
📝 Formatted script with 10 turns

🔄 Processing with VibeVoice (streaming mode)...
⏱️ Generation completed in 361.46 seconds
🎵 Final audio duration: 115.33 seconds
📊 Total chunks: 865
✨ Generation successful! Complete audio is ready in the 'Complete Audio' tab.
💡 Not satisfied? You can regenerate or adjust the CFG scale for different results.

Aug 27 '25 06:08 xxnuo

Thanks for your attention. The bottleneck of generation is LLM forward. Therefore, the optimation tech of LLM can also apply on VibeVoice, like vLLM etc.

Aug 28 '25 08:08 pengzhiliang

float16, sdpa (if you don't have flash attention like me) ... and most importantly the 7b model needs to be quantized.

Aug 28 '25 17:08 patientx

It's been some days but stil can't find quantized version of the 7b model. I found the result of 7b is more supperior but the generation time is killing the mood. I got almost 2secs/it on my comfyui with 4070ti 12gb vram. the 1.5b far faster with almost 6it/sec. Even if 7b gguf is made, the node still can't use it.

Aug 30 '25 09:08 kukalikuk

I guess there isn't enough interest in this plus it is interesting even with my Rx 6800 which is way less powerful than your GPU I am getting similar speeds. If there wasn't the VRAM problem we would see the real speeds.

Aug 30 '25 09:08 patientx

@patientx Yes, my VRAM is 64G, it's not the bottleneck here.

Aug 30 '25 13:08 xxnuo

It's been some days but stil can't find quantized version of the 7b model. I found the result of 7b is more supperior but the generation time is killing the mood. I got almost 2secs/it on my comfyui with 4070ti 12gb vram. the 1.5b far faster with almost 6it/sec. Even if 7b gguf is made, the node still can't use it.

I added bnb_nf4 on the fly quantization for comfyUI. 7B Inference takes 9.2 GB vram in total instead of 18 GB for bf16. But nf4 is 2 times slower than bf16. Quality is still very good. https://github.com/Mozer/VibeVoice-ComfyUI

Aug 30 '25 13:08 Mozer

It's been some days but stil can't find quantized version of the 7b model. I found the result of 7b is more supperior but the generation time is killing the mood. I got almost 2secs/it on my comfyui with 4070ti 12gb vram. the 1.5b far faster with almost 6it/sec. Even if 7b gguf is made, the node still can't use it.

I added bnb_nf4 on the fly quantization for comfyUI. 7B Inference takes 9.2 GB vram in total instead of 18 GB for bf16. But nf4 is 2 times slower than bf16. Quality is still very good. https://github.com/Mozer/VibeVoice-ComfyUI

can you make a pull request here too?

Aug 31 '25 08:08 FurkanGozukara

I got the same speeds testing the 7b on a b200 as I did on a 3090 (just under 2x real time speed)

Sep 01 '25 07:09 KevinAHM

It's been some days but stil can't find quantized version of the 7b model. I found the result of 7b is more supperior but the generation time is killing the mood. I got almost 2secs/it on my comfyui with 4070ti 12gb vram. the 1.5b far faster with almost 6it/sec. Even if 7b gguf is made, the node still can't use it.

I added bnb_nf4 on the fly quantization for comfyUI. 7B Inference takes 9.2 GB vram in total instead of 18 GB for bf16. But nf4 is 2 times slower than bf16. Quality is still very good. https://github.com/Mozer/VibeVoice-ComfyUI

Wow thanks, now i got around 3it/sec with bnb_nf4. Better than previous bf16 which overloading my vram👍🏻👍🏻👍🏻 Did this implemented also in main branch? For 1.5b, bnb_nf4 indeed makes the generation slower as you said. I also found that doing 10 steps is better for my particular language (not english/chinese)

Sep 01 '25 08:09 kukalikuk

Yeah, this makes absolutely no sense to me but....

RTX 3080ti (12gb vram ... cost me ~$900 like 4-5 years ago): 16it/s on the 1.5b model

RTX Pro 6000 Blackwell (96gb vram ... ~$10k and came out 4 or 5 months ago): 8it/s on the 1.5b model

Oct 05 '25 05:10 nickheyer

I replaced HF-transformers LLM engine with exllamav3 LLM engine. It gave me x3 speed-up for the LLM part. Overall speed for vibevoice-7b on my 3090 is now 9 it/s. And with lower diffusion steps (5) it is now generating in real-time speed with streaming. 7b-exl-4bit + no-llm-bf16 takes 9.5GB VRAM. Quality is still nice. Now I am also adding wav2lip support to this.

https://github.com/mozer/comfyUI-vibevoice-exl3

Oct 05 '25 05:10 Mozer