Disong Wang comments

Results 11 comments of


                                            Disong Wang

trafficstars

lf0 question about convert phase

Hi, normalizing f0 aims to remove the speaker characteristics. During preprocessing phase, f0 is not normalized, but during training and inference, f0 is normalized as shown below: https://github.com/Wendison/VQMIVC/blob/851b4f5ca5bb60c11fea6a618affeb4979b17cf3/dataset.py#L53 https://github.com/Wendison/VQMIVC/blob/851b4f5ca5bb60c11fea6a618affeb4979b17cf3/convert_example.py#L57

lf0 question about convert phase

The perplexity should be increasing during training, as higer perplexity indicates that the vectors in the VQ codebook are distinguiable and can be used to represent different acoustic units. I...

Training Loss Abnormal

Hi, I think the lld losses are normal, you could train for more epoches and listen to converted samples to verify whether your training is successful.

Training Loss Abnormal

1. I don't remember the exact value of each loss, but I think your losses should be normal according to the losses shown in https://github.com/Wendison/VQMIVC/issues/15#issue-1025051309 2. Based on my experience,...

Mel stats and Vocoder

Hi, based on my experience, using the same mel stats for vocoder and VC model leads to better voice quality, so for your questions: 1) I think that training a...

How to solve this problem?

I haven't encountered this problem when I run the scripts for training PWG. Maybe you can try to use VCTK as your training dataset and follow the original training steps,...

How to solve this problem?

Hi, you can use `sox` or `ffmpeg` to change the sampling rate of wavs, e.g., `sox {original_24kHz.wav} -r 16000 {converted.wav}`

What do z_dim and c_dim stand for？

Hi, all these three variables are related with content encoder, z_dim denotes the dimension of acoustic units (z) in VQ codebook, c_dim denotes the dimension of continuous vectors after LSTM...

What do z_dim and c_dim stand for？

128 is the number of frames of mel-spectrograms used for training, it denotes 1.28s of waveform.

No training speed improvement can be obtained by using multi-gpus with mxnet as the backend

Below is the training process with 1 gpu and 4 gpus respectively, 1 gpu: ![x6m d d shd hw7nc b](https://user-images.githubusercontent.com/16378089/30089819-c2360cf2-92e1-11e7-9d23-e2dd87f0eca5.png) 4 gpus: ![yhu d969fc2t_crzpvwyc5j](https://user-images.githubusercontent.com/16378089/30089823-c85a0a7a-92e1-11e7-93c9-e04e5c5c5297.png) It seems that the training with...