VoiceCraft icon indicating copy to clipboard operation
VoiceCraft copied to clipboard

VoiceCraft with Parler-TTS's 10K hours speech data

Open rishikksh20 opened this issue 10 months ago • 3 comments

Hi @jasonppy Have you look at the Huggingface's Data Speech, a 10k hour of clean curated TTS data : https://huggingface.co/datasets/parler-tts/mls-eng-10k-tags_tagged_10k_generated . I think training the 830M model on this data will result excellent and robust samples. I am planning to do some multi-lingual training on large datset. I have fined tuned 330M data on 1k hour of multi-lingual data and for good news it worked well and also preserves accent when we used multi-lingual lines to TTS.

rishikksh20 avatar Apr 21 '24 09:04 rishikksh20

@jasonppy Have you tried to use Vocos for decoding task rather than Encodec decoder, first of all it upsample the samples to 24 kHz and leads to clear crisp and better voice quality.

rishikksh20 avatar Apr 30 '24 08:04 rishikksh20

And another suggestion might improve quality of the Audio is to Replace Encodec fully with DAC similar to Parler-TTS (https://github.com/huggingface/parler-tts/blob/main/parler_tts/dac_wrapper/configuration_dac.py) . It results 44.1 kHz audio with 8 kbps bandwidth

rishikksh20 avatar Apr 30 '24 09:04 rishikksh20

Hi @rishikksh20, I really like your ideas. I strongly believe this model has great potential, but sadly the output audio quality is quite bad being limited to only 16000 Hz. Even using high quality input audios you must accept that output will be extremely dirty. Were you able to do any training or testing regarding what you proposed?

If you don't mind, can I ask you if you can share a notebook (if any) with the code you used to perform the finetune that you mentioned? I'd like to try a finetune too.

Thank you very much.

Sweetapocalyps3 avatar May 28 '24 09:05 Sweetapocalyps3