Steve Korshakov comments

Results 169 comments of


                                            Steve Korshakov

Difference between Phoneme and Text tokenizer

Yes, my first version was duration-only. This worked well, i have added coarse pitch later to improve prosody.

Colab for Synthesis

Yes, you need MFA, but you don't need alignment for full dataset, you can just run on files from your samples.

Colab for Synthesis

Yes you can do this, just zero out required prompts (do not provide any) and you will get random voices which you can later use as prompt.

Colab for Synthesis

MFA is a proxy between text-phoneme pairs, since gpt takes text and generates phonemes and durations you will get all you need and pack it to the similar pt file.

Colab for Synthesis

> Yes, you need MFA, but you don't need alignment for full dataset, you can just run on files from your samples. Yes

How many hours speech data is used?

Right now it was trained on libritts-r, which is quite low like 1k hours at most. I am in the process of preparing 3tb dataset that would be used for...

How many hours speech data is used?

> This model produces good voice quality and prosody for such a small amount of data if we train this model on a good amount of multi-lingual dataset, we will...

Any plan to add latest coding models like WizardCoder-Python, Phind-Codellama?

In latest version you can try to use them providing custom model names, please, ping back if something would work. IIRC most of the models are chat-like, not completion-like.

[Bug]: SSH key file not accessible

Same problem here, using cloud version