wavenet_vocoder icon indicating copy to clipboard operation
wavenet_vocoder copied to clipboard

Help extending to MAILabs data - Warbly speech - MoL, 1000k steps

Open adhamel opened this issue 5 years ago • 6 comments

Dear @r9y9, I've trained a MoL wavenet to 1000k steps on ~30,000 audio samples from M-AI Labs data. I am using a pre-trained transformer from @kan-bayashi.

The resulting audio has rather intelligible speech, but has a bit of a warble to it that I would like to clear up. Happy to share generated samples or configurations to help diagnose. Do you have any experience training on that data set or recommendations on what might move me in the right direction?

Best, Andy

adhamel avatar Mar 18 '20 20:03 adhamel

Hi, sorry for the late reply. If I remember correctly, samples in M-AI labs are of low SN ratio, and thus WaveNet might suffer from learning a distribution of clean speech. To diagnose what the reasons would be, could you share some generated audio samples and training configurations?

r9y9 avatar Mar 24 '20 04:03 r9y9

Hey, no worries. I trained with the mixture-of-logistics configuration, used data from a single male Spanish speaker. I've followed your recommendations elsewhere and decreased the log_min allowed as the training progressed.

Here is an sample after ~1.6M steps: https://github.com/adhamel/samples/blob/master/response.wav

For evaluation, I'm using generated _npy features from this transformer (https://github.com/espnet/espnet/blob/master/egs/m_ailabs/tts1/RESULTS.md):

v.0.5.3 / Transformer Silence trimming FTT in points: 1024 Shift in points: 256 Frequency limit: 80-7600 Fast-GL 64 iters Environments date: Sun Sep 29 21:20:05 JST 2019 python version: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] espnet version: espnet 0.5.1 chainer version: chainer 6.0.0 pytorch version: pytorch 1.0.1.post2 Git hash: 6b2ff45d1e2c624691f197014b8fe71a5e70bae9 Commit date: Sat Sep 28 14:33:32 2019 +0900

adhamel avatar Mar 24 '20 19:03 adhamel

Could you also share the config file(s) for WaveNet?

For the generated sample, it seems that the signal gain is too high. I guess there would be a mismatch between acoustic features at the training time and ones at evaluation. Did you carefully normalize acoustic feauters? Did you make sure that you use same acoustic feature pipeline for both training Transformer and WaveNet?

r9y9 avatar Mar 25 '20 07:03 r9y9

Absolutely. Here are the overwritten hparams. I also tried an fmin value of 125. I did not take care to normalize acoustic features, however the WaveNet is trained on the same data subset as the Transformer.

{ "name": "wavenet_vocoder", "input_type": "raw", "quantize_channels": 65536, "preprocess": "preemphasis", "postprocess": "inv_preemphasis", "global_gain_scale": 0.55, "sample_rate": 16000, "silence_threshold": 2, "num_mels": 80, "fmin": 80, "fmax": 7600, "fft_size": 1024, "hop_size": 256, "frame_shift_ms": null, "win_length": 1024, "win_length_ms": -1.0, "window": "hann", "highpass_cutoff": 70.0, "output_distribution": "Logistic", "log_scale_min": -32.23619130191664, "out_channels": 30, "layers": 24, "stacks": 4, "residual_channels": 128, "gate_channels": 256, "skip_out_channels": 128, "dropout": 0.0, "kernel_size": 3, "cin_channels": 80, "cin_pad": 2, "upsample_conditional_features": true, "upsample_net": "ConvInUpsampleNetwork", "upsample_params": { "upsample_scales": [ 4, 4, 4, 4 ] }, "gin_channels": -1, "n_speakers": 7, "pin_memory": true, "num_workers": 2, "batch_size": 8, "optimizer": "Adam", "optimizer_params": { "lr": 0.001, "eps": 1e-08, "weight_decay": 0.0 }, "lr_schedule": "step_learning_rate_decay", "lr_schedule_kwargs": { "anneal_rate": 0.5, "anneal_interval": 200000 }, "max_train_steps": 1000000, "nepochs": 2000, "clip_thresh": -1, "max_time_sec": null, "max_time_steps": 10240, "exponential_moving_average": true, "ema_decay": 0.9999, "checkpoint_interval": 100000, "train_eval_interval": 100000, "test_eval_epoch_interval": 50, "save_optimizer_state": true }

adhamel avatar Mar 25 '20 15:03 adhamel

The harams looks okay. I'd recommend you to double-check acoustic feature normalization differences (if any), and also please check analysis/synthesis quality (not TTS).

Pre-emphasis at the data preprocessing stage changes the signal gain, so you might want to turn global_gain_scale. 0.55 was chosen for LJSpeech if I remember correctly.

Another suggestion is that using more higher log scale min (e.g., -9 or -11). As suggested in ClariNet paper, smaller variance bound requires more iterations for training and could be unstable.

r9y9 avatar Mar 30 '20 08:03 r9y9

Thank you, you are correct. I will test reducing log scale min. (As a strange aside, I found significant drops in loss at intervals of ~53 epochs.) I hope y'all are staying safe over there.

adhamel avatar Apr 02 '20 20:04 adhamel