training on a new language
Hi @YaoyaoChang Is there already a script for pretraining on a new language for the streaming model, if not, what's required to make?
We do not plan to allow fine-tuning of the streaming model, due to potential DeepFake risks. Which new languages would you like?
hindi.
We do not plan to allow fine-tuning of the streaming model, due to potential DeepFake risks. Which new languages would you like?
Arabic
Spanish
Polish.
I tried every model, awesome. Please train Polish models or git the way to do that, it would bee awesome for story telling.
Spanish
Thanks for your feedback! We did some initial tests using English speakers, and it appears that the model can produce German, Spanish, Portuguese, Japanese, and Arabic to some extent. However, we haven’t conducted enough training or thorough evaluation for these languages, so we can’t guarantee the quality or stability of the results at this stage.
The actual performance may vary, and we recommend testing your own use cases to see how well it works for you. You can also test other languages and provide us with some feedback, because we don’t understand many of them and aren’t sure how well they actually perform. We also plan to provide more multilingual speaker embeddings in the future to improve cross-lingual performance.
Thanks again for the suggestion and for trying the model!
Urdu language please
@wenhui0924
I conducted several tests on the base text, which was generated with consideration for the specifics of the Polish language. In the attachments, I am including a report consisting of partial logs from the generation process, the source text, and the generated files.
As for my reflection on the model, I did not notice the occurrence of so-called “hallucinations,” and the generation is very fast. Regarding audio specifications, the best performance is achieved with the settings cfg1.5/inference5 or cfg1.7/inference8. The model has a distinctive English accent, reminiscent of a person with Polish roots but raised in an English-speaking country—something like a child emigrating permanently to the USA at the age of seven and then returning to Poland after 30 years (never using the Polish language again in the meantime).
I suspect that if the model is further trained with a Polish audio dataset (flac/wav-320kbps) along with detailed transcription, it will be able to pronounce all the nuances of the language with 100% accuracy. At present, I must admit that such complex combinations of letters like rz/sz/cz/ś/ć/ź/ń/ó/ą/ę/ż remain challenging.
Personally, I use the XTTSv2 2.0.3 model, but I must say that VibeVoice-Realtime-0.5B is very promising, especially considering the absence of hallucinations. The prosody and the spacing between commas and periods are at a very high level.
I look forward to your possible corrections and improvements :)
1p_vibevoice.txt 1p_vibevoice_generated_cfg1.1_inference3.wav 1p_vibevoice_generated_cfg1.5_inference5.wav 1p_vibevoice_generated_cfg1.7_inference8.wav 1p_vibevoice_generated_cfg3_inference5.wav 1p_vibevoice_generated_cfg3_inference20.wav 2p_vibevoice.txt 2p_vibevoice_generated_cfg1.5_inference5.wav raports.txt
German please
Korean please