Polish language thread
In some generations, the first two or three words were a complete/incomprehensible hallucination (as if the model was warming up - steering towards the Polish language). Furthermore, it read Japanese names surprisingly well, but, for example, it read the word "jujitsu" plainly, without the typical Polonization characteristic of the Polish language. In some longer generations, the quality deteriorates after about 3 minutes, after which it returns to correct quality. I am attaching all my tests.
An excellent model; I see huge potential in it. Definitely more optimized than the 1.5B or Large version. Although these are voice cloning models, they can be highly unstable and require significant effort in post-production. Generally, the male voice sounds better and behaves more stably, although it is good to have a comparison with the female voice. I suspect that the voice model needs to be trained on a larger audio dataset with ideal transcription for these specific phenomena. But the results are genuinely satisfactory. In terms of PC performance, the constant VRAM requirement is about 3.5GB.
I am attaching all my tests for listening, along with the script and commentary.
@YaoyaoChang Thank you very much for such quick action and providing the model for testing. I can't wait for the next sample. Perhaps a small script could be made available for retraining on my own dataset; that would be very helpful for testing the appropriate characters—those specific to the Polish language.
[01] -MAN- -WOMAN- -COMMENTARY- -LOG_DETAILS- -TEXT_SCRIPT-
[02] -MAN- -WOMAN- -COMMENTARY- -LOG_DETAILS- -TEXT_SCRIPT-
[03]-MAN- -WOMAN- -LOG_DETAILS- -TEXT_SCRIPT-
[04] -MAN- -WOMAN- -LOG_DETAILS- -TEXT_SCRIPT-
lets see the comments of viewers and get knowledge
In response to the suggestion from user "sd983527" in thread #115, I am currently researching available Polish audio datasets. I would like to obtain more information about the exact dataset that @YaoyaoChang @TeamVibeVoice need.
A few questions that would help facilitate collaboration:
- How long does the dataset need to be in total?
- What length should a single audio file in the dataset have? (minimum, maximum, division into sentences/complex sentences?)
- Is an exact transcription required? (If so, would the .csv format be appropriate?)
- What optimal quality should the dataset have? (wav, flac, 16bit, 24bit, 22/48khz, mp3-320kbps)?
- Are stereo audio tracks acceptable, or should the dataset be normalized to mono?
- Should the recordings come from a single author, or can it be multi-speaker, e.g., several female voices in one dataset? (analogous question regarding male voices)
Let's check the streaming app for fun