ppg-vc
ppg-vc copied to clipboard
Training strategy
Hey, i am back again with another question :P
Can I interpret the two-stage training scheme as:
- The training of CTC-Attention phoneme recognizer, speaker encoder, and Vocoder. Above three can be trained separately on their own.
- The training of seq2seqMoL, it will need the output from CTC-Attention phoneme recognizer and speaker encoder. Each training instance is like (A's sentence_1, B's sentence_x, B's sentence_1), MSE is computed between the model's output B's sentence_1 and the ground truth B's sentence_1.
Please correct me if i am wrong.
Thanks!