Kaizhi Qian
Kaizhi Qian
from onmt_modules.misc import sequence_mask
num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16
The embedding in metadata.pkl should be a vector of length 256. The N you got might be the number of speakers.
You can average all the d-vectors without normalization.
The details are described in the paper.
You are right. In this case, you have to retrain the model using your speaker embeddings.
Clip to [0,1]
@xw1324832579 You can use one-hot embedding if you are not doing zero-shot conversion. Retraining takes less than 12 hrs on single gpu.
They don't have to be the same.
@liveroomand Looks fine. You can refer to r9y9's wavenet vocoder for more details on spectrogram normalizaton and clipping