Nikita Grebenyuk
Nikita Grebenyuk
Yes, you can use it on CPU (RAM), just add .cpu() command to some lines in script (which will give your error).
> Thanks, will upload the pre-trained hifi-gan model as well as the configuration file soon. Could you share link?
VITS authors don't answer here. Probably you should test it yourself, but I think if you use your own letter set and some letters can replace comma, dot etc, why...
Try to add short phrases into your dataset. If it's trained to say some phonemes only in connection with other, it can't do single word well.
> If it is caused by data-hunger, then how much data needed for each speaker if I make a multi-speaker instance? About 2 hours is minimum for good result.
No discriminator model, so no way for fine-tuning.
"All models in the ablation study were trained up to 300k steps" From paper.
> So the available checkpoint in the GitHub repository is also trained up to 300k steps? Must be so. Train dataset is 12.5k, so it's ~1500 iterations over the whole...
Too big wav files are omitted: https://github.com/jaywalnut310/vits/blob/main/data_utils.py#L302 You can try to change it from here (increase numbers in boundaries, for example `[32,300,450,600,750,1000,1200,1400,1600]`): https://github.com/jaywalnut310/vits/blob/main/train.py#L70 (but note that too big files may...
Or nicer way - move it in config and setup as you want: https://github.com/jaywalnut310/vits/pull/119/commits/490b60abe3978650d4341c0c64dc49ab76287d58