Kaizhi Qian
Kaizhi Qian
You could just skip the shorter utterances.
The output of the f0 predictor is 257 dim logit instead of one-hot. So, you need to use cross-entropy loss as indicated in the paper.
The target is the quantized the ground truth f0, based on https://arxiv.org/abs/2004.07370
The posted solution is to use the vocoder from AutoVC. They share the same vocoder and thus not included in this repo.
First of all, you need to install the appropriate version of r9y9's Wavenet vocoder, which is a large and delicate repo by itself. We did not include it in our...
no, the pretrained model only works for speakers in the training set
@leijue222 yes you can, but you need to re-train the model.
You can make it generalize to unseen speakers by training it the same way as AutoVC.
@skol101 it means training with generalized speaker embeddings instead of one-hot embeddings