nonparaSeq2seqVC_code icon indicating copy to clipboard operation
nonparaSeq2seqVC_code copied to clipboard

Interpreting results

Open JRMeyer opened this issue 5 years ago • 15 comments

Hi @jxzhanggg,

I think our discussion will be interesting to others, so I'm posting this as a Github issue. If there's another place to better discuss this, let me know.

I would like to hear your thoughts on the results I've gotten from VCTK so far. It's promising, but definitely doesn't sound as good as what your demopage shows. I've pre-trained on VCTK, and now I'm inspecting the output of pre-trained/inference.py.

Training Info

  • Trained on 94 of VCTK speakers
  • Batch size of 16
  • Single GPU
  • did not use spectrograms (only mel-spectrograms)
  • did not use mean / std normalization
  • Trained for 413,000 iterations (resulting in checkpoint_413000)

Inference Info

  • Griffith-Lim vocoder
  • did not use spectrograms (only mel-spectrograms)
  • did not use mean / std normalization
  • tested on 2 VCTK speakers unseen in training (but did appear in the validation set)

Results

  • I can hear a muffled human voice, but it is not clear enough to understand
  • alignment looks promising, but not complete

What are your thoughts on this? How can I achieve a better result?

Thank you!

Wav__ref_p374_VC.zip Ali__ref_p374_VC.pdf Hid__ref_p374_VC.pdf

JRMeyer avatar Jan 17 '20 00:01 JRMeyer

Hi, I listened to your reuslts, it seemed not good. The decoder alignment actually haven't converge sucessfully. Therefore, the model couldn't generate intelligible voice. I suppose the mean standard normalization is an important step to facilitate the converenge of the model. Therefore, it's advisable to add this. Alignment converges after first training of 3k steps in my experiment for you reference. There's some of my samples at inference stage. samples.zip

jxzhanggg avatar Jan 20 '20 03:01 jxzhanggg

Hi, I used a model at step 59000, and the VC total loss reduced to around 1.2, but all inference samples results in almost null. They looked like this: 68FA419F-F48B-4542-B420-FBC9CE49E1EC My question is how many steps does it need to train a model, and what should the loss level be like?

youngsuenXMLY avatar Feb 26 '20 11:02 youngsuenXMLY

I get almost the same results as yours @JRMeyer , have you solved the problem?

youngsuenXMLY avatar Mar 03 '20 02:03 youngsuenXMLY

My test results: test_samples.zip

youngsuenXMLY avatar Mar 03 '20 02:03 youngsuenXMLY

I recommend you to filter out some long sentences in your training dataset and try suggestion from here

jxzhanggg avatar Mar 06 '20 17:03 jxzhanggg

In the pre-train folder, I use a decay rate 0.95 at each epoch and abandon training samples whose frame length is longer than 800. The inferred results begin to make sense but sound unreasonable. samples_55000_loss0.94.zip What may the problem be?

youngsuenXMLY avatar Mar 10 '20 06:03 youngsuenXMLY

Hi, the alignment didn't converge. Therefore, it's unable to generated meanful sounds. It's strange, have you trained the model with enough large training set ? The batch size also should be large enough (better >= 32) to help alignment converge. Could you provide more details about your training?

In the pre-train folder, I use a decay rate 0.95 at each epoch and abandon training samples whose frame length is longer than 800. The inferred results begin to make sense but sound unreasonable. samples_55000_loss0.94.zip

jxzhanggg avatar Mar 10 '20 16:03 jxzhanggg

Hi, in the feature extraction process, I trimmed silence using librosa.trim and I used 80 dimensional mel-spec as used in hparams.py. The text look like this: image But the mean and standard variance are calculated using a running method. So mean and variance are global. image The mean and variance look like this: image

youngsuenXMLY avatar Mar 11 '20 02:03 youngsuenXMLY

In pre-train/model/layers.py, line 353-354, I change the code to self.initialize_decoder_states(memory, mask=(1-get_mask_from_lengths(memory_lengths))) Becase I found ~ is a bit-wise reverse, ~1 will get 254.

youngsuenXMLY avatar Mar 11 '20 03:03 youngsuenXMLY

I can't get any possible difference from the source code. So would you please send me a copy of your training text and phn files. @jxzhanggg

youngsuenXMLY avatar Mar 11 '20 05:03 youngsuenXMLY

In pre-train/model/layers.py, line 353-354, I change the code to self.initialize_decoder_states(memory, mask=(1-get_mask_from_lengths(memory_lengths))) Becase I found ~ is a bit-wise reverse, ~1 will get 254.

I tested code in python 3.5 and torch 1.4, that's true. It's strange bug because ~1 will get correct 0 when using python 2.7 torch 1.01 I'm not sure it's caused by python version or torch version ? How about your experiment environment? A bug like this wrong mask will definitely make model be out of order. So I suspect there's some unrecognized bugs likes this cause the failure of experiments. About training lists, I'm glad to provide. But I can't access to these files these days (I'm not in University now). And I believe it's not phn's faults.

jxzhanggg avatar Mar 11 '20 10:03 jxzhanggg

I conducted the experiment under ubuntu16.04, using pytorch1.3.1 and python3.7. For a boolean type variable, ~True gets False and ~False gets True. I will debug Please send me a copy of text and phn files. [email protected] and [email protected] both are ok. Thank you.

youngsuenXMLY avatar Mar 12 '20 02:03 youngsuenXMLY

after modifying the bit-wise reverse ~, the model begin to converge to reasonable speech. One problem is the inferred result doesn't keep the speaking style from the speaker embeddings, which means the style is not well disentangled. I will try to do some experiment to disentangle the style and content based on your work. Thank you very much for your patient replies.

youngsuenXMLY avatar Mar 13 '20 08:03 youngsuenXMLY

after modifying the bit-wise reverse ~, the model begin to converge to reasonable speech. One problem is the inferred result doesn't keep the speaking style from the speaker embeddings, which means the style is not well disentangled. I will try to do some experiment to disentangle the style and content based on your work. Thank you very much for your patient replies.

Hi, I'm glad that you got it worked.

jxzhanggg avatar Mar 13 '20 10:03 jxzhanggg

Have you tried VAE loss to further disentangle content embedding from speaker embedding?

youngsuenXMLY avatar Mar 30 '20 12:03 youngsuenXMLY