Kaizhi Qian comments

Results 196 comments of


                                            Kaizhi Qian

from onmt_modules.misc import sequence_mask

num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16

The embedding in metadata.pkl should be a vector of length 256. The N you got might be the number of speakers.

You can average all the d-vectors without normalization.

The details are described in the paper.

You are right. In this case, you have to retrain the model using your speaker embeddings.

Clip to [0,1]

@xw1324832579 You can use one-hot embedding if you are not doing zero-shot conversion. Retraining takes less than 12 hrs on single gpu.

They don't have to be the same.

@liveroomand Looks fine. You can refer to r9y9's wavenet vocoder for more details on spectrogram normalizaton and clipping