nsynth_wavenet
nsynth_wavenet copied to clipboard
Reproducing parallel results
I've been trying to reproduce the parallel wavenet results, however I'm running into some issues with training the teacher model. I have trained it on the LJ Speech dataset with the default wavenet_mol.json configuration to ~280k steps (all other hparams are unchanged as well). The loss looks good, however the evaluated speech is just babbling, as if local conditioning wasn't used.
I didn't see anything immediately apparent as to why this is happening, do you have any ideas?
Hi @mortont Could you provide more information? For example, the loss curve, some generated samples and the code commit.
Sure, I'm on commit 67eacb995aef465d1e2ed810f25e0d7d3899e9b6
and this is the loss curve
This is one of the generated samples: gen_LJ001-0028.wav.zip For reference, this is the original that was used for the mel conditioning: LJ001-0028.wav.zip
What loss did you let the teacher get to in your examples? Wondering if this just needs more training time. Let me know if any other information would be helpful.
According to my experience, if a small batch size such as 2 is used, 200k training steps is not enough. Because the model haven't seen enough data. What's your batch size? Besides please ensure USE_RESIZE_CONV=False in masked.py.
This is my loss curve.
Ah, thank you. My batch size is 2 so I'll continue training. I'll go ahead and close this issue and report back when I have good samples for posterity.
Hi @mortont , I'm sorry that the wavenet results cannot be reproduced with the default config json. Recently I cannot reporduce the previous result either. And I finally figure out that weight normalization harms the wavenet performance. So setting use_weigh_norm=false in wavenet_mol.json may solve your problem. Initially weight normalization produces some promising results for parallel wavenet. So I keep it as a default configuration. Probabily it has negative effects on the learned mel-sepctrum representation. And good mel-spectrum representation is vital for a good modle, since the wave sample points totaly depend on theit. I'm also doing simaliar tests on parallel wavenet. Once again, I'm sorry.
Good catch @bfs18, thank you! I'll try training with use_weight_norm
as false. Unfortunately my training is a bit slow with a single GPU. Do you have a pretrained teacher model to share? I'd be happy to help on any parts of parallel wavenet once I have a good teacher.
Actually, I just saw the updated readme... I'll check that out now.
Hi,
I tested the code, setting use_weigth_norm=False
solves the problem.
And I implemented http://export.arxiv.org/abs/1807.07281 at this weekend. I will update the code after some testing.
I am getting error when loading the pre-trained model https://drive.google.com/open?id=13rHT6zr2sXeedmjUOpp6IVQdT30cy66_
File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/training/saver.py", line 1812, in latest_checkpoint if file_io.get_matching_files(v2_path) or file_io.get_matching_files( File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/lib/io/file_io.py", line 337, in get_matching_files for single_filename in filename File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: /data/logs-wavenet/eval_ckpt/ns_pwn-n_MU-WN_DDI_mfinit-n_LOGS-n_CLIP-MAG-L2-06_27; No such file or directory
Any idea?
Hi @zhang-jian, I'm sorry for that. The first line of ns_pwn-eval/checkpoint contains the absolute path of the model checkpoint. You can modify the path according to your file system. Or you can only keep the basename of that path. Refer to ns_wn-eval/checkpoint for an example.
I was able to confirm that the pretrained model produces a recognizable voice, all that was needed was changing the path in the checkpoint file to a relative one. Great work!
@bfs18 ClariNet looks very interesting, mostly because of the more numerically stable implementation. I've noticed that parallel wavenet optimization is very difficult and unstable, so hopefully ClariNet helps with that.
I got some new examples running with contrastive loss and without weight normalization at step 70948. The result may improve a bit after longer running. gen_LJ001-0001 gen_LJ001-0001
I look deeper into the problem.
Some bad configuration (e.g. weight normalization + tanh trans conv act) may cause the activation of the transposed convolution layer saturate. So the mel condition becomes meaningless. The model degenerate to an unconditional one.
The following figures are the histogram and spectrum of the transposed convolution stack output. This model only generate random speech even though mel condition is used.
In contrast, the following figures come from an OK model.
Most of the activation values are close to 0. So the learned representation may be considered sparse.
I think there are 2 solutions.
- use activation functions that would not saturate, e.g. leaky_relu in Clarinet.
- When teacher forcing is used, the model can predict the original waveform totally conditioned on the teacher forcing input. The teacher forcing input contains the complete information. So we can use dropout to make the teacher forcing input incomplete. Then the model is forced to access the additional mel condition data to predict the original waveform. I'm not sure to which layers dropout should be applied. I am working on this.
- add noise to teacher forcing inputs. The previous adding noise implementation is buggy. Because I added noise both to inputs and outputs. So the predicted wave is noisy. I will fix this.
Hi, I am running the eval_parallel_wavenet.py, after 60K training, it can generate the audio with content, however, the sound is quite light, is this problem related to the power loss. Beside, the config does not include contrastive loss, how should I set this parameter?
Hi @EdisonChen726 I uploaded the model with contrastive loss. You can find the configuration json in the package. https://drive.google.com/open?id=1AtofQdXbSutb-_ZWFeA_I17NR2i8nUC7
@bfs18 thank you for the fast reply, I will try it asap
Updated Clarinet vocoder results. Clarinet results has similar noise compared to pwn results. So I think the noise comes from the power loss term.
Comapred to the teacher result, the student result does not have clear formats between 1000Hz to 3000Hz. This may be the source of the noise in the waves generated by the student.
teacher spec
student spec
The priority frequencies loss implemented in keithito/tacotron may relivate the problem.
@bfs18 hi, have you meet the problem of the very slight audio result, I need to turn up the volume to very high so that I can hear the voice, do you have any idea why this happen? the volume of teacher model's result is good, but the pwn is not.
Hi @EdisonChen726
Setting use_mu_law=True would cause low volume when training parallel wavenet.
It is caused by clip_quant_scale function in L13 in wavenet/parallelgen.py. I don't know how to solve the problem.
You can test it with the following code.
import numpy as np
import librosa
def inv_mu_law_numpy(x, mu=255.0):
x = np.array(x).astype(np.float32)
out = (x + 0.5) * 2. / (mu + 1)
out = np.sign(out) / mu * ((1 + mu) ** np.abs(out) - 1)
out = np.where(np.equal(x, 0), x, out)
return out
def cast_quantize_numpy(x, quant_chann):
x_quantized = x * quant_chann / 2
return x_quantized.astype(np.int32)
audio, _ = librosa.load('test_data/test.wav', sr=16000)
audio_int = cast_quantize_numpy(audio, 2 ** 8)
audio_ = inv_mu_law_numpy(audio_int)
librosa.output.write_wav('test_data/test_inv.wav', audio_, sr=16000)
The volume of the output wave becomes very low.
@bfs18 get it! thank you so much. Right now I have another problem, when I add the contrastive loss with 0.3, there will be the error as: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4,64,1,7680] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Do you know how to solve it?
@EdisonChen726 just try smaller batch size.
@bfs18 Hi,I use your model(wavenet_mol without pwn) to test synthesized speech, the mute part will become a murmur, and the non-mute part is normal. Do you know why? Is it because of the trim at the time of training?
Hi @switchzts The use_mu_law + mol waves are much cleaner at mute part. However no_mu_law + mol waves are just as you say. So 200k steps may be not enough to train a good no_mu_law + mol model. I am not sure whether trim is a problem.
@bfs18 Hi, I tried to set the batch size as 1, but the same error happened.
Hi @bfs18 , I have some questions about initialization.
-
Why the scale_params bias init by -0.3? Is this an experience value? And why not use log scale in student net? https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L243 https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L92
-
In readme, you mentioned "Proper initial mean_tot and scale_tot values have positive impact on model convergence and numerical stability. According to the LJSpeech data distribution, proper initial values for mean_tot and scale_tot should be 0.0 and 0.05. I modified the initializer to achieve it." In parallel_wavenet.py mean_tot and scale_tot are init by 0 and 1, which initializer is modified to achieve proper initial values for mean_tot and scale_tot (0.0 and 0.05). https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L276
Thank you!
Hi @EdisonChen726 What's you gpu memory size? I only run the code on gpu with 12 gb memory or more.
Hi @HallidayReadyOne --Why the scale_params bias init by -0.3? Is this an experience value? Yes. I wrote some memos on why chose this value and why not use log scale in the comments in test_scale Let me know if you need further explaination.
Hi @bfs18, thanks for the kindly reply. I still need some guidance. In test_scale.py, you set the input data by a normally distributed with mean 0.0 and std 1.0, because after data dependent initialization for weight normalization, the output of conv/deconv is approximately normally distributed with mean 0.0 and std 1.0? https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/tests/test_scale.py#L137 However, you also set use_weight_normalization = False for both wn&pwn. If use_weight_normalization = False. Is this assumption still true (the output of conv/deconv is approximately normally distributed with mean 0.0 and std 1.0)?
Hi @HallidayReadyOne You are right this value is picked when use_weight_norm=True. Since it is chosen by experience, it is not that strict. When setting use_weight_norm=False, the initial scale is still small enough. So I keep this value.
Thanks @bfs18, another question about init is, In readme, you mentioned "Proper initial mean_tot and scale_tot values have positive impact on model convergence and numerical stability. According to the LJSpeech data distribution, proper initial values for mean_tot and scale_tot should be 0.0 and 0.05. I modified the initializer to achieve it." Could you please explain a little bit about how this is achieved?