hifi-gan icon indicating copy to clipboard operation
hifi-gan copied to clipboard

difficulty of training Universal model with fmax is null

Open YutoNishimura-v2 opened this issue 3 years ago • 8 comments

Hello! Thank you so much for publishing such a great code! Thanks to you, I'm enjoying my voice conversion!

I'm currently using hifigan as a vocoder for voice conversion, and I'm trying to train it to create a Universal model with different parameters, but I'm having trouble.

I need some advice. Here is the config.

{ "resblock": "1", "num_gpus": 0, "batch_size": 12, "learning_rate": 0.0002, "adam_b1": 0.8, "adam_b2": 0.99, "lr_decay": 0.999, "seed": 1234,

"upsample_rates": [5,5,3,2,2],
"upsample_kernel_sizes": [11,11,7,4,4],
"upsample_initial_channel": 512,
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"discriminator_periods": [2,3,5,7,11,17,23],

"segment_size": 9600,
"num_mels": 80,
"n_fft": 2048,
"hop_size": 300,
"win_size": 1200,

"sampling_rate": 24000,

"fmin": 0,
"fmax": null,
"fmax_for_loss": null,

"num_workers": 4,

"dist_config": {
    "dist_backend": "nccl",
    "dist_url": "tcp://localhost:54321",
    "world_size": 1
}

}

An image of the change in LOSS is also attached below.

The result was no different from training LJSpeech by itself.

Here are some other things we tried.

  • Prepared mel without audio normalization: no change in results
  • Trained using the distributed Universal config: Almost the same result here
  • Finetuning using the distributed universal weights: almost the same result.
  • Training with mel without dividing audio by max_wav: Worse

Dataset used

  • JSUT
  • JVS
  • LibriTTS
  • LJSpeech
  • VCTK I train with a mixture of all of these.

The reason is that I set fmax to null (i.e., 12000, which is half of 24000). I don't want to change the other parameters of the stft as much as possible because they are set by the VC.

I'm sorry for the length of this article, and thank you for your patience.

スクリーンショット 2021-07-08 105800

YutoNishimura-v2 avatar Jul 08 '21 02:07 YutoNishimura-v2

Hello! Thank you so much for publishing such a great code! Thanks to you, I'm enjoying my voice conversion!

I'm currently using hifigan as a vocoder for voice conversion, and I'm trying to train it to create a Universal model with different parameters, but I'm having trouble.

I need some advice. Here is the config.

{ "resblock": "1", "num_gpus": 0, "batch_size": 12, "learning_rate": 0.0002, "adam_b1": 0.8, "adam_b2": 0.99, "lr_decay": 0.999, "seed": 1234,

"upsample_rates": [5,5,3,2,2],
"upsample_kernel_sizes": [11,11,7,4,4],
"upsample_initial_channel": 512,
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"discriminator_periods": [2,3,5,7,11,17,23],

"segment_size": 9600,
"num_mels": 80,
"n_fft": 2048,
"hop_size": 300,
"win_size": 1200,

"sampling_rate": 24000,

"fmin": 0,
"fmax": null,
"fmax_for_loss": null,

"num_workers": 4,

"dist_config": {
    "dist_backend": "nccl",
    "dist_url": "tcp://localhost:54321",
    "world_size": 1
}

}

An image of the change in LOSS is also attached below.

The result was no different from training LJSpeech by itself.

Here are some other things we tried.

* Prepared mel without audio normalization: no change in results

* Trained using the distributed Universal config: Almost the same result here

* Finetuning using the distributed universal weights: almost the same result.

* Training with mel without dividing audio by max_wav: Worse

Dataset used

* JSUT

* JVS

* LibriTTS

* LJSpeech

* VCTK
  I train with a mixture of all of these.

The reason is that I set fmax to null (i.e., 12000, which is half of 24000). I don't want to change the other parameters of the stft as much as possible because they are set by the VC.

I'm sorry for the length of this article, and thank you for your patience.

スクリーンショット 2021-07-08 105800

So what is your trouble?

Alexey322 avatar Jul 08 '21 13:07 Alexey322

Thank you for your reply. And I apologize for not clarifying my question.

So, what I'm trying to ask here is.

I can't get mel loss to go to 0.2 (not enough quality for inference). I would like to know how I can achieve that.

Is it a bad parameter? Or is it a problem with the data?

YutoNishimura-v2 avatar Jul 08 '21 13:07 YutoNishimura-v2

@YutoNishimura-v2 I think the authors of this repository trained a universal vocoder for several million iterations, since the data was about 1000 hours (judging by this comment #1).

Your fmax is automatically converted to half the sample rate of your audio when you set it to null.

I would train up to 1 million iterations and only then draw conclusions.

Alexey322 avatar Jul 08 '21 14:07 Alexey322

As for the data, we are using Libri, VCTK, and LJSpeech, which the author says is sufficient!

I thought I had failed because the slope looked almost zero when I looked at the val, but after listening to you, I think I still need to train I'll let it train for 1M and if it doesn't work, I'll ask again!

Thank you very much.

YutoNishimura-v2 avatar Jul 08 '21 14:07 YutoNishimura-v2

@Alexey322 Hello.

I have just taken your advice and am training for 1M, but even after 50K (1/20 of 1M) of training, the validation loss is not even 0.0001 lower.

In general, I think that the loss tends to decrease more easily in the early stage of learning. Therefore, I don't think that it will decrease after 1M.

So, if there is any history left, can you show me the initial change in loss when you ran 1M before?

Honestly, I feel that there is something fundamentally wrong, not just an iter problem.

Thank you very much.

YutoNishimura-v2 avatar Jul 10 '21 03:07 YutoNishimura-v2

2048

Do you know how to set the config if my sample_rate is 16k ?

Tian14267 avatar Nov 26 '21 06:11 Tian14267

@Alexey322 Hello.

I have just taken your advice and am training for 1M, but even after 50K (1/20 of 1M) of training, the validation loss is not even 0.0001 lower.

In general, I think that the loss tends to decrease more easily in the early stage of learning. Therefore, I don't think that it will decrease after 1M.

So, if there is any history left, can you show me the initial change in loss when you ran 1M before?

Honestly, I feel that there is something fundamentally wrong, not just an iter problem.

Thank you very much.

Any tips or updates here for Universal Hifigan, which is better between fmax=null or fmax=8000 by default?

v-nhandt21 avatar May 17 '22 06:05 v-nhandt21