VibeVoice icon indicating copy to clipboard operation
VibeVoice copied to clipboard

training on a new language

Open mohammed-bahumaish opened this issue 4 weeks ago • 29 comments

Hi @YaoyaoChang Is there already a script for pretraining on a new language for the streaming model, if not, what's required to make?

mohammed-bahumaish avatar Dec 04 '25 16:12 mohammed-bahumaish

We do not plan to allow fine-tuning of the streaming model, due to potential DeepFake risks. Which new languages would you like?

YaoyaoChang avatar Dec 04 '25 16:12 YaoyaoChang

hindi.

utkarshshukla2912 avatar Dec 04 '25 17:12 utkarshshukla2912

We do not plan to allow fine-tuning of the streaming model, due to potential DeepFake risks. Which new languages would you like?

Arabic

mohammed-bahumaish avatar Dec 04 '25 17:12 mohammed-bahumaish

Spanish

giandiego avatar Dec 04 '25 18:12 giandiego

Polish.

I tried every model, awesome. Please train Polish models or git the way to do that, it would bee awesome for story telling.

ziom6270 avatar Dec 04 '25 18:12 ziom6270

Spanish

Sheldonimo avatar Dec 04 '25 21:12 Sheldonimo

Thanks for your feedback! We did some initial tests using English speakers, and it appears that the model can produce German, Spanish, Portuguese, Japanese, and Arabic to some extent. However, we haven’t conducted enough training or thorough evaluation for these languages, so we can’t guarantee the quality or stability of the results at this stage.

The actual performance may vary, and we recommend testing your own use cases to see how well it works for you. You can also test other languages and provide us with some feedback, because we don’t understand many of them and aren’t sure how well they actually perform. We also plan to provide more multilingual speaker embeddings in the future to improve cross-lingual performance.

Thanks again for the suggestion and for trying the model!

wenhui0924 avatar Dec 05 '25 03:12 wenhui0924

Urdu language please

haseebsultankhan avatar Dec 05 '25 09:12 haseebsultankhan

@wenhui0924

I conducted several tests on the base text, which was generated with consideration for the specifics of the Polish language. In the attachments, I am including a report consisting of partial logs from the generation process, the source text, and the generated files.

As for my reflection on the model, I did not notice the occurrence of so-called “hallucinations,” and the generation is very fast. Regarding audio specifications, the best performance is achieved with the settings cfg1.5/inference5 or cfg1.7/inference8. The model has a distinctive English accent, reminiscent of a person with Polish roots but raised in an English-speaking country—something like a child emigrating permanently to the USA at the age of seven and then returning to Poland after 30 years (never using the Polish language again in the meantime).

I suspect that if the model is further trained with a Polish audio dataset (flac/wav-320kbps) along with detailed transcription, it will be able to pronounce all the nuances of the language with 100% accuracy. At present, I must admit that such complex combinations of letters like rz/sz/cz/ś/ć/ź/ń/ó/ą/ę/ż remain challenging.

Personally, I use the XTTSv2 2.0.3 model, but I must say that VibeVoice-Realtime-0.5B is very promising, especially considering the absence of hallucinations. The prosody and the spacing between commas and periods are at a very high level.

I look forward to your possible corrections and improvements :)

1p_vibevoice.txt 1p_vibevoice_generated_cfg1.1_inference3.wav 1p_vibevoice_generated_cfg1.5_inference5.wav 1p_vibevoice_generated_cfg1.7_inference8.wav 1p_vibevoice_generated_cfg3_inference5.wav 1p_vibevoice_generated_cfg3_inference20.wav 2p_vibevoice.txt 2p_vibevoice_generated_cfg1.5_inference5.wav raports.txt

ziom6270 avatar Dec 05 '25 10:12 ziom6270

German please

dingausmwald avatar Dec 06 '25 02:12 dingausmwald

Korean please

onedge avatar Dec 06 '25 02:12 onedge