Disong Wang
Disong Wang
Hi, normalizing f0 aims to remove the speaker characteristics. During preprocessing phase, f0 is not normalized, but during training and inference, f0 is normalized as shown below: https://github.com/Wendison/VQMIVC/blob/851b4f5ca5bb60c11fea6a618affeb4979b17cf3/dataset.py#L53 https://github.com/Wendison/VQMIVC/blob/851b4f5ca5bb60c11fea6a618affeb4979b17cf3/convert_example.py#L57
The perplexity should be increasing during training, as higer perplexity indicates that the vectors in the VQ codebook are distinguiable and can be used to represent different acoustic units. I...
Hi, I think the lld losses are normal, you could train for more epoches and listen to converted samples to verify whether your training is successful.
1. I don't remember the exact value of each loss, but I think your losses should be normal according to the losses shown in https://github.com/Wendison/VQMIVC/issues/15#issue-1025051309 2. Based on my experience,...
Hi, based on my experience, using the same mel stats for vocoder and VC model leads to better voice quality, so for your questions: 1) I think that training a...
I haven't encountered this problem when I run the scripts for training PWG. Maybe you can try to use VCTK as your training dataset and follow the original training steps,...
Hi, you can use `sox` or `ffmpeg` to change the sampling rate of wavs, e.g., `sox {original_24kHz.wav} -r 16000 {converted.wav}`
Hi, all these three variables are related with content encoder, z_dim denotes the dimension of acoustic units (z) in VQ codebook, c_dim denotes the dimension of continuous vectors after LSTM...
128 is the number of frames of mel-spectrograms used for training, it denotes 1.28s of waveform.
Below is the training process with 1 gpu and 4 gpus respectively, 1 gpu:  4 gpus:  It seems that the training with...