Steve Korshakov
Steve Korshakov
Yes, my first version was duration-only. This worked well, i have added coarse pitch later to improve prosody.
Yes, you need MFA, but you don't need alignment for full dataset, you can just run on files from your samples.
Yes you can do this, just zero out required prompts (do not provide any) and you will get random voices which you can later use as prompt.
MFA is a proxy between text-phoneme pairs, since gpt takes text and generates phonemes and durations you will get all you need and pack it to the similar pt file.
> Yes, you need MFA, but you don't need alignment for full dataset, you can just run on files from your samples. Yes
Right now it was trained on libritts-r, which is quite low like 1k hours at most. I am in the process of preparing 3tb dataset that would be used for...
> This model produces good voice quality and prosody for such a small amount of data if we train this model on a good amount of multi-lingual dataset, we will...
In latest version you can try to use them providing custom model names, please, ping back if something would work. IIRC most of the models are chat-like, not completion-like.
Same problem here, using cloud version