ttts icon indicating copy to clipboard operation
ttts copied to clipboard

Use this model for Voice conversion

Open rishikksh20 opened this issue 1 year ago • 11 comments

Hi @adelacvg

Can we use this kind model for speech to speech (Voice conversion).

rishikksh20 avatar Dec 16 '23 12:12 rishikksh20

Yes, and it's exactly what I'm working on. You can have a look here for a rough idea of the approach. The main idea is to utilize Referencenet to enhance the zero-shot capability.

adelacvg avatar Dec 18 '23 02:12 adelacvg

I checked v4 branch looks good to me. Have you train the model if yes how's the quality?

rishikksh20 avatar Dec 18 '23 06:12 rishikksh20

I checked your v3 branch also and samples are sounding good. Have you train that model on any english dataset ?

rishikksh20 avatar Dec 18 '23 06:12 rishikksh20

I would like to train v3 and v4 for large english dataset. Would you guide me little bit. May HuBERT , XLS-R use to extract semantic vector or contentvec is only required?

rishikksh20 avatar Dec 18 '23 06:12 rishikksh20

I do not recommend training with v3 because it still uses inefficient modules like FiLM for timbre addition. As for training with v4, all I can say is that the training is very, very slow, but it's worth it. Using a small batch size and a relatively large learning rate may be a cost-effective approach. The longer the training time, the better the results. Although contentvec may not be perfect, I think it's sufficient. Other semantic features might lead to timbre leakage, although I haven't conducted extensive experiments to validate this.

adelacvg avatar Dec 19 '23 04:12 adelacvg

I trained using the same dataset as v2, which is a mixed dataset containing both Chinese and English.

adelacvg avatar Dec 19 '23 04:12 adelacvg

Ok than I will try to train v4 only, but is that repo completed implemented or something remains ? If it's completed have run any kind of train on it ?

rishikksh20 avatar Dec 20 '23 05:12 rishikksh20

The current code is trainable, and I have obtained some promising results. It's worth noting that the convergence is slow, and a batch size of 32 takes about 500k steps to yield satisfactory results. I have implemented the code for modules like cfg and offset noise, but for the sake of training stability, I haven't added them temporarily. These functionalities can be added through fine-tuning after convergence.

adelacvg avatar Dec 20 '23 08:12 adelacvg

dataset size? and on how many gpus you trained the model? Actually, I am planning to train this model on Multi-lingual Librispeech model which have 50k hours of data. But before that I will do a demo training on small dataset size of around 3k to 5k hours to check the parameters and training stability.

rishikksh20 avatar Dec 20 '23 08:12 rishikksh20

I only used 300 hours of data, and the training was done exclusively on two GeForce RTX 3090 GPUs.

adelacvg avatar Dec 20 '23 10:12 adelacvg

Hi @adelacvg is implementation of this end-to-end TTS repo is completed. I have tested NS2VC v4 on 500 hrs of Hindi dataset with whisper features and it's working great, I have few findings on that repo which I will share on that repo issues.

rishikksh20 avatar Jan 17 '24 05:01 rishikksh20