Kaizhi Qian

Results 196 comments of Kaizhi Qian
trafficstars

The hyperparams are listed in the paper. I did not use any trick such as lr schedule. As I said, it is very easy to train.

If your data is very different from vctk, you probably need to re-train the F0-converter

Yes. In that case, you probably need to tweak other parts of the model as well.

mfcc_stats.pkl is the mean and std of the mfcc. spk2emb_82.pkl is mapping from speaker name to one-hot embedding.

dctmx = scipy.fftpack.dct(np.eye(80), type=2, axis=1, norm='ortho') you can just use the dctmx in mfcc_stats if you use the same spectrogram specifications

What matters is the model's required inputs. The model requires source cepstrum, source cepstrum lengths, masks made from source ceptrum lengths, and target speaker embedding. That's what the dictionary provides....

You are welcome. For the basics, please refer to the paper. https://arxiv.org/abs/2106.08519

The rhythm code provides the alignment information. The decoder just use this information automatically to align the content code and/or pitch code.

you can understand it by reading the code for pitch conversion

With the code released on github and the extensive discussions and descriptions in the paper, one should be able reproduce it with very little effort.