Kaizhi Qian comments

Results 196 comments of


                                            Kaizhi Qian

trafficstars

re-training result, it is not good enough, can you share some advice about Hyperparameter？

The hyperparams are listed in the paper. I did not use any trick such as lr schedule. As I said, it is very easy to train.

How to get a generated speech from the output of the trained Generator?

If your data is very different from vctk, you probably need to re-train the F0-converter

How to get a generated speech from the output of the trained Generator?

Yes. In that case, you probably need to tweak other parts of the model as well.

How to make 'mfcc_stats.pkl' and 'spk2emb_82.pkl'?

mfcc_stats.pkl is the mean and std of the mfcc. spk2emb_82.pkl is mapping from speaker name to one-hot embedding.

How to make 'mfcc_stats.pkl' and 'spk2emb_82.pkl'?

dctmx = scipy.fftpack.dct(np.eye(80), type=2, axis=1, norm='ortho') you can just use the dctmx in mfcc_stats if you use the same spectrogram specifications

Missing basic execution with different set of speakers.

What matters is the model's required inputs. The model requires source cepstrum, source cepstrum lengths, masks made from source ceptrum lengths, and target speaker embedding. That's what the dictionary provides....

Missing basic execution with different set of speakers.

You are welcome. For the basics, please refer to the paper. https://arxiv.org/abs/2106.08519

How to align multiple sequences while they are from different source?

The rhythm code provides the alignment information. The decoder just use this information automatically to align the content code and/or pitch code.

Could you please describe details of rhythm-only conversion ?

you can understand it by reading the code for pitch conversion

How to use this project on another dataset?

With the code released on github and the extensive discussions and descriptions in the paper, one should be able reproduce it with very little effort.