ttts
ttts copied to clipboard
Use this model for Voice conversion
Hi @adelacvg
Can we use this kind model for speech to speech (Voice conversion).
Yes, and it's exactly what I'm working on. You can have a look here for a rough idea of the approach. The main idea is to utilize Referencenet to enhance the zero-shot capability.
I checked v4 branch looks good to me. Have you train the model if yes how's the quality?
I checked your v3 branch also and samples are sounding good. Have you train that model on any english dataset ?
I would like to train v3 and v4 for large english dataset. Would you guide me little bit. May HuBERT , XLS-R use to extract semantic vector or contentvec is only required?
I do not recommend training with v3 because it still uses inefficient modules like FiLM for timbre addition. As for training with v4, all I can say is that the training is very, very slow, but it's worth it. Using a small batch size and a relatively large learning rate may be a cost-effective approach. The longer the training time, the better the results. Although contentvec may not be perfect, I think it's sufficient. Other semantic features might lead to timbre leakage, although I haven't conducted extensive experiments to validate this.
I trained using the same dataset as v2, which is a mixed dataset containing both Chinese and English.
Ok than I will try to train v4 only, but is that repo completed implemented or something remains ? If it's completed have run any kind of train on it ?
The current code is trainable, and I have obtained some promising results. It's worth noting that the convergence is slow, and a batch size of 32 takes about 500k steps to yield satisfactory results. I have implemented the code for modules like cfg and offset noise, but for the sake of training stability, I haven't added them temporarily. These functionalities can be added through fine-tuning after convergence.
dataset size? and on how many gpus you trained the model? Actually, I am planning to train this model on Multi-lingual Librispeech model which have 50k hours of data. But before that I will do a demo training on small dataset size of around 3k to 5k hours to check the parameters and training stability.
I only used 300 hours of data, and the training was done exclusively on two GeForce RTX 3090 GPUs.
Hi @adelacvg is implementation of this end-to-end TTS repo is completed. I have tested NS2VC v4 on 500 hrs of Hindi dataset with whisper features and it's working great, I have few findings on that repo which I will share on that repo issues.