Audiovisual-Synthesis
Audiovisual-Synthesis copied to clipboard
Data required for Training
To train a model from scratch, it needs about 30 minutes of the target speaker’s speech data and around 10k iterations to converge Is it a single 30 minute audio file or can be multiple small audio files?