Kaizhi Qian
Kaizhi Qian
The hyperparams are listed in the paper. I did not use any trick such as lr schedule. As I said, it is very easy to train.
If your data is very different from vctk, you probably need to re-train the F0-converter
Yes. In that case, you probably need to tweak other parts of the model as well.
mfcc_stats.pkl is the mean and std of the mfcc. spk2emb_82.pkl is mapping from speaker name to one-hot embedding.
dctmx = scipy.fftpack.dct(np.eye(80), type=2, axis=1, norm='ortho') you can just use the dctmx in mfcc_stats if you use the same spectrogram specifications
What matters is the model's required inputs. The model requires source cepstrum, source cepstrum lengths, masks made from source ceptrum lengths, and target speaker embedding. That's what the dictionary provides....
You are welcome. For the basics, please refer to the paper. https://arxiv.org/abs/2106.08519
The rhythm code provides the alignment information. The decoder just use this information automatically to align the content code and/or pitch code.
you can understand it by reading the code for pitch conversion
With the code released on github and the extensive discussions and descriptions in the paper, one should be able reproduce it with very little effort.