nsynth_wavenet Reproducing parallel results

I've been trying to reproduce the parallel wavenet results, however I'm running into some issues with training the teacher model. I have trained it on the LJ Speech dataset with the default wavenet_mol.json configuration to ~280k steps (all other hparams are unchanged as well). The loss looks good, however the evaluated speech is just babbling, as if local conditioning wasn't used.

I didn't see anything immediately apparent as to why this is happening, do you have any ideas?

Jul 12 '18 19:07 mortont

Hi @mortont Could you provide more information? For example, the loss curve, some generated samples and the code commit.

Jul 13 '18 12:07 bfs18

Sure, I'm on commit 67eacb995aef465d1e2ed810f25e0d7d3899e9b6 and this is the loss curve screen shot 2018-07-13 at 9 33 06 am

This is one of the generated samples: gen_LJ001-0028.wav.zip For reference, this is the original that was used for the mel conditioning: LJ001-0028.wav.zip

What loss did you let the teacher get to in your examples? Wondering if this just needs more training time. Let me know if any other information would be helpful.

Jul 13 '18 13:07 mortont

According to my experience, if a small batch size such as 2 is used, 200k training steps is not enough. Because the model haven't seen enough data. What's your batch size? Besides please ensure USE_RESIZE_CONV=False in masked.py. This is my loss curve. loss_curve

Jul 13 '18 14:07 bfs18

Ah, thank you. My batch size is 2 so I'll continue training. I'll go ahead and close this issue and report back when I have good samples for posterity.

Jul 13 '18 14:07 mortont

Hi @mortont , I'm sorry that the wavenet results cannot be reproduced with the default config json. Recently I cannot reporduce the previous result either. And I finally figure out that weight normalization harms the wavenet performance. So setting use_weigh_norm=false in wavenet_mol.json may solve your problem. Initially weight normalization produces some promising results for parallel wavenet. So I keep it as a default configuration. Probabily it has negative effects on the learned mel-sepctrum representation. And good mel-spectrum representation is vital for a good modle, since the wave sample points totaly depend on theit. I'm also doing simaliar tests on parallel wavenet. Once again, I'm sorry.

Jul 20 '18 18:07 bfs18

Good catch @bfs18, thank you! I'll try training with use_weight_norm as false. Unfortunately my training is a bit slow with a single GPU. Do you have a pretrained teacher model to share? I'd be happy to help on any parts of parallel wavenet once I have a good teacher.

Jul 20 '18 20:07 mortont

Actually, I just saw the updated readme... I'll check that out now.

Jul 20 '18 20:07 mortont

Hi, I tested the code, setting use_weigth_norm=False solves the problem. And I implemented http://export.arxiv.org/abs/1807.07281 at this weekend. I will update the code after some testing.

Jul 22 '18 15:07 bfs18

I am getting error when loading the pre-trained model https://drive.google.com/open?id=13rHT6zr2sXeedmjUOpp6IVQdT30cy66_

  File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/training/saver.py", line 1812, in latest_checkpoint
    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
  File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/lib/io/file_io.py", line 337, in get_matching_files
    for single_filename in filename
  File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /data/logs-wavenet/eval_ckpt/ns_pwn-n_MU-WN_DDI_mfinit-n_LOGS-n_CLIP-MAG-L2-06_27; No such file or directory

Any idea?

Jul 23 '18 10:07 zhang-jian

Hi @zhang-jian, I'm sorry for that. The first line of ns_pwn-eval/checkpoint contains the absolute path of the model checkpoint. You can modify the path according to your file system. Or you can only keep the basename of that path. Refer to ns_wn-eval/checkpoint for an example.

Jul 23 '18 14:07 bfs18

I was able to confirm that the pretrained model produces a recognizable voice, all that was needed was changing the path in the checkpoint file to a relative one. Great work!

@bfs18 ClariNet looks very interesting, mostly because of the more numerically stable implementation. I've noticed that parallel wavenet optimization is very difficult and unstable, so hopefully ClariNet helps with that.

Jul 23 '18 21:07 mortont

I got some new examples running with contrastive loss and without weight normalization at step 70948. The result may improve a bit after longer running. gen_LJ001-0001 gen_LJ001-0001

loss4 loss3 loss2 loss1

Jul 30 '18 02:07 bfs18

I look deeper into the problem. Some bad configuration (e.g. weight normalization + tanh trans conv act) may cause the activation of the transposed convolution layer saturate. So the mel condition becomes meaningless. The model degenerate to an unconditional one. The following figures are the histogram and spectrum of the transposed convolution stack output. This model only generate random speech even though mel condition is used. after_act2_hist In contrast, the following figures come from an OK model. Most of the activation values are close to 0. So the learned representation may be considered sparse. I think there are 2 solutions.

use activation functions that would not saturate, e.g. leaky_relu in Clarinet.
When teacher forcing is used, the model can predict the original waveform totally conditioned on the teacher forcing input. The teacher forcing input contains the complete information. So we can use dropout to make the teacher forcing input incomplete. Then the model is forced to access the additional mel condition data to predict the original waveform. I'm not sure to which layers dropout should be applied. I am working on this.
add noise to teacher forcing inputs. The previous adding noise implementation is buggy. Because I added noise both to inputs and outputs. So the predicted wave is noisy. I will fix this.

Jul 31 '18 03:07 bfs18

Hi, I am running the eval_parallel_wavenet.py, after 60K training, it can generate the audio with content, however, the sound is quite light, is this problem related to the power loss. Beside, the config does not include contrastive loss, how should I set this parameter?

Aug 07 '18 02:08 EdisonChen726

Hi @EdisonChen726 I uploaded the model with contrastive loss. You can find the configuration json in the package. https://drive.google.com/open?id=1AtofQdXbSutb-_ZWFeA_I17NR2i8nUC7

Aug 07 '18 07:08 bfs18

@bfs18 thank you for the fast reply, I will try it asap

Aug 07 '18 07:08 EdisonChen726

Updated Clarinet vocoder results. Clarinet results has similar noise compared to pwn results. So I think the noise comes from the power loss term. Comapred to the teacher result, the student result does not have clear formats between 1000Hz to 3000Hz. This may be the source of the noise in the waves generated by the student. teacher spec wn1 student spec pwn1 The priority frequencies loss implemented in keithito/tacotron may relivate the problem.

Aug 07 '18 11:08 bfs18

@bfs18 hi, have you meet the problem of the very slight audio result, I need to turn up the volume to very high so that I can hear the voice, do you have any idea why this happen? the volume of teacher model's result is good, but the pwn is not.

Aug 08 '18 02:08 EdisonChen726

Hi @EdisonChen726 Setting use_mu_law=True would cause low volume when training parallel wavenet.
It is caused by clip_quant_scale function in L13 in wavenet/parallelgen.py. I don't know how to solve the problem. You can test it with the following code.

import numpy as np
import librosa 

def inv_mu_law_numpy(x, mu=255.0):
    x = np.array(x).astype(np.float32)
    out = (x + 0.5) * 2. / (mu + 1)
    out = np.sign(out) / mu * ((1 + mu) ** np.abs(out) - 1)
    out = np.where(np.equal(x, 0), x, out)
    return out

def cast_quantize_numpy(x, quant_chann):
    x_quantized = x * quant_chann / 2
    return x_quantized.astype(np.int32)

audio, _ = librosa.load('test_data/test.wav', sr=16000)
audio_int = cast_quantize_numpy(audio, 2 ** 8)
audio_ = inv_mu_law_numpy(audio_int)
librosa.output.write_wav('test_data/test_inv.wav', audio_, sr=16000)

The volume of the output wave becomes very low.

Aug 08 '18 07:08 bfs18

@bfs18 get it! thank you so much. Right now I have another problem, when I add the contrastive loss with 0.3, there will be the error as: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4,64,1,7680] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Do you know how to solve it?

Aug 08 '18 09:08 EdisonChen726

@EdisonChen726 just try smaller batch size.

Aug 08 '18 11:08 bfs18

@bfs18 Hi,I use your model(wavenet_mol without pwn) to test synthesized speech, the mute part will become a murmur, and the non-mute part is normal. Do you know why? Is it because of the trim at the time of training?

Aug 09 '18 06:08 switchzts

Hi @switchzts The use_mu_law + mol waves are much cleaner at mute part. However no_mu_law + mol waves are just as you say. So 200k steps may be not enough to train a good no_mu_law + mol model. I am not sure whether trim is a problem.

Aug 09 '18 07:08 bfs18

@bfs18 Hi, I tried to set the batch size as 1, but the same error happened.

Aug 09 '18 08:08 EdisonChen726

Hi @bfs18 , I have some questions about initialization.

Why the scale_params bias init by -0.3? Is this an experience value? And why not use log scale in student net? https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L243 https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L92
In readme, you mentioned "Proper initial mean_tot and scale_tot values have positive impact on model convergence and numerical stability. According to the LJSpeech data distribution, proper initial values for mean_tot and scale_tot should be 0.0 and 0.05. I modified the initializer to achieve it." In parallel_wavenet.py mean_tot and scale_tot are init by 0 and 1, which initializer is modified to achieve proper initial values for mean_tot and scale_tot (0.0 and 0.05). https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L276

Thank you!

Aug 09 '18 08:08 HallidayReadyOne

Hi @EdisonChen726 What's you gpu memory size? I only run the code on gpu with 12 gb memory or more.

Aug 12 '18 09:08 bfs18

Hi @HallidayReadyOne --Why the scale_params bias init by -0.3? Is this an experience value? Yes. I wrote some memos on why chose this value and why not use log scale in the comments in test_scale Let me know if you need further explaination.

Aug 12 '18 09:08 bfs18

Hi @bfs18, thanks for the kindly reply. I still need some guidance. In test_scale.py, you set the input data by a normally distributed with mean 0.0 and std 1.0, because after data dependent initialization for weight normalization, the output of conv/deconv is approximately normally distributed with mean 0.0 and std 1.0? https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/tests/test_scale.py#L137 However, you also set use_weight_normalization = False for both wn&pwn. If use_weight_normalization = False. Is this assumption still true (the output of conv/deconv is approximately normally distributed with mean 0.0 and std 1.0)?

Aug 13 '18 02:08 HallidayReadyOne

Hi @HallidayReadyOne You are right this value is picked when use_weight_norm=True. Since it is chosen by experience, it is not that strict. When setting use_weight_norm=False, the initial scale is still small enough. So I keep this value.

Aug 13 '18 02:08 bfs18

Thanks @bfs18, another question about init is, In readme, you mentioned "Proper initial mean_tot and scale_tot values have positive impact on model convergence and numerical stability. According to the LJSpeech data distribution, proper initial values for mean_tot and scale_tot should be 0.0 and 0.05. I modified the initializer to achieve it." Could you please explain a little bit about how this is achieved?

Aug 13 '18 03:08 HallidayReadyOne

nsynth_wavenet nsynth_wavenet copied to clipboard

Reproducing parallel results

nsynth_wavenet
nsynth_wavenet copied to clipboard