VoiceCraft icon indicating copy to clipboard operation
VoiceCraft copied to clipboard

more training details of the TTS enhanced models

Open zjlww opened this issue 1 year ago • 7 comments

Hi, thank you for open-sourcing your excellent work. ❤️

I would like to compare with VoiceCraft as a baseline for my research. I have observed that you have released three TTS enhanced models. I am curious about the training datasets used for all these models. Can I utilize them to evaluate zero-shot TTS models?

zjlww avatar Apr 23 '24 12:04 zjlww

Thanks! 830M TTS enhanced and 330M TTS enhanced (to be uploaded) are trained on gigaspeech + lightlight. I recommend using 830M TTS enhanced to evaluate.

jasonppy avatar Apr 23 '24 13:04 jasonppy

Hi @jasonppy -- I'm curious, if you can spare the details, how exactly did you train the TTS enhanced model compared to the base model? Is it a separate training script? Separate loss? Or simply separate data?

Thanks a lot.

rlenain avatar Apr 26 '24 08:04 rlenain

Hi @jasonppy -- I'm curious, if you can spare the details, how exactly did you train the TTS enhanced model compared to the base model? Is it a separate training script? Separate loss? Or simply separate data?

Thanks a lot.

The TTS enhanced model are trained without the first rearrange step introduced in the paper (i.e. no masking)

jasonppy avatar Apr 26 '24 15:04 jasonppy

Thanks !

rlenain avatar Apr 30 '24 08:04 rlenain

Sorry, actually there is something that I don't understand: is the TTS enhanced model trained from scratch as such, or simply finetuned with that specific objective (i.e. no masking) from the base 830m model? Is there a specific script / recipe that exists in the repo to train/finetune like you trained the TTS enhanced model?

Thanks a lot!

rlenain avatar May 01 '24 09:05 rlenain

they are finetuned from the giga830M/giga330M that's trained with causal masking. Right now the scripts are not uploaded to the repo yet.

jasonppy avatar May 01 '24 15:05 jasonppy

I tested the TTSEnhanced models, including the 330M and 830M. sometimes it repeats too long, or can't pronounce short words. Maybe we can set some rules to decide when to stop predicting, or add ASR post-processing to check if the pronunciation is correct. image test_sample.zip

Approximetal avatar Jun 06 '24 10:06 Approximetal