hifi-gan difficulty of training Universal model with fmax is null

Hello! Thank you so much for publishing such a great code! Thanks to you, I'm enjoying my voice conversion!

I'm currently using hifigan as a vocoder for voice conversion, and I'm trying to train it to create a Universal model with different parameters, but I'm having trouble.

I need some advice. Here is the config.

{ "resblock": "1", "num_gpus": 0, "batch_size": 12, "learning_rate": 0.0002, "adam_b1": 0.8, "adam_b2": 0.99, "lr_decay": 0.999, "seed": 1234,

"upsample_rates": [5,5,3,2,2],
"upsample_kernel_sizes": [11,11,7,4,4],
"upsample_initial_channel": 512,
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"discriminator_periods": [2,3,5,7,11,17,23],

"segment_size": 9600,
"num_mels": 80,
"n_fft": 2048,
"hop_size": 300,
"win_size": 1200,

"sampling_rate": 24000,

"fmin": 0,
"fmax": null,
"fmax_for_loss": null,

"num_workers": 4,

"dist_config": {
    "dist_backend": "nccl",
    "dist_url": "tcp://localhost:54321",
    "world_size": 1
}

}

An image of the change in LOSS is also attached below.

The result was no different from training LJSpeech by itself.

Here are some other things we tried.

Prepared mel without audio normalization: no change in results
Trained using the distributed Universal config: Almost the same result here
Finetuning using the distributed universal weights: almost the same result.
Training with mel without dividing audio by max_wav: Worse

Dataset used

JSUT
JVS
LibriTTS
LJSpeech
VCTK I train with a mixture of all of these.

The reason is that I set fmax to null (i.e., 12000, which is half of 24000). I don't want to change the other parameters of the stft as much as possible because they are set by the VC.

I'm sorry for the length of this article, and thank you for your patience.

スクリーンショット 2021-07-08 105800

Jul 08 '21 02:07 YutoNishimura-v2

Hello! Thank you so much for publishing such a great code! Thanks to you, I'm enjoying my voice conversion!

I'm currently using hifigan as a vocoder for voice conversion, and I'm trying to train it to create a Universal model with different parameters, but I'm having trouble.

I need some advice. Here is the config.
{ "resblock": "1", "num_gpus": 0, "batch_size": 12, "learning_rate": 0.0002, "adam_b1": 0.8, "adam_b2": 0.99, "lr_decay": 0.999, "seed": 1234,
"upsample_rates": [5,5,3,2,2],
"upsample_kernel_sizes": [11,11,7,4,4],
"upsample_initial_channel": 512,
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"discriminator_periods": [2,3,5,7,11,17,23],

"segment_size": 9600,
"num_mels": 80,
"n_fft": 2048,
"hop_size": 300,
"win_size": 1200,

"sampling_rate": 24000,

"fmin": 0,
"fmax": null,
"fmax_for_loss": null,

"num_workers": 4,

"dist_config": {
    "dist_backend": "nccl",
    "dist_url": "tcp://localhost:54321",
    "world_size": 1
}
}
An image of the change in LOSS is also attached below.

The result was no different from training LJSpeech by itself.

Here are some other things we tried.
* Prepared mel without audio normalization: no change in results

* Trained using the distributed Universal config: Almost the same result here

* Finetuning using the distributed universal weights: almost the same result.

* Training with mel without dividing audio by max_wav: Worse
Dataset used
* JSUT

* JVS

* LibriTTS

* LJSpeech

* VCTK
  I train with a mixture of all of these.
The reason is that I set fmax to null (i.e., 12000, which is half of 24000). I don't want to change the other parameters of the stft as much as possible because they are set by the VC.

I'm sorry for the length of this article, and thank you for your patience.

So what is your trouble?

Jul 08 '21 13:07 Alexey322

Thank you for your reply. And I apologize for not clarifying my question.

So, what I'm trying to ask here is.

I can't get mel loss to go to 0.2 (not enough quality for inference). I would like to know how I can achieve that.

Is it a bad parameter? Or is it a problem with the data?

Jul 08 '21 13:07 YutoNishimura-v2

@YutoNishimura-v2 I think the authors of this repository trained a universal vocoder for several million iterations, since the data was about 1000 hours (judging by this comment #1).

Your fmax is automatically converted to half the sample rate of your audio when you set it to null.

I would train up to 1 million iterations and only then draw conclusions.

Jul 08 '21 14:07 Alexey322

As for the data, we are using Libri, VCTK, and LJSpeech, which the author says is sufficient!

I thought I had failed because the slope looked almost zero when I looked at the val, but after listening to you, I think I still need to train I'll let it train for 1M and if it doesn't work, I'll ask again!

Thank you very much.

Jul 08 '21 14:07 YutoNishimura-v2

@Alexey322 Hello.

I have just taken your advice and am training for 1M, but even after 50K (1/20 of 1M) of training, the validation loss is not even 0.0001 lower.

In general, I think that the loss tends to decrease more easily in the early stage of learning. Therefore, I don't think that it will decrease after 1M.

So, if there is any history left, can you show me the initial change in loss when you ran 1M before?

Honestly, I feel that there is something fundamentally wrong, not just an iter problem.

Thank you very much.

Jul 10 '21 03:07 YutoNishimura-v2

2048

Do you know how to set the config if my sample_rate is 16k ?

Nov 26 '21 06:11 Tian14267

@Alexey322 Hello.

I have just taken your advice and am training for 1M, but even after 50K (1/20 of 1M) of training, the validation loss is not even 0.0001 lower.

In general, I think that the loss tends to decrease more easily in the early stage of learning. Therefore, I don't think that it will decrease after 1M.

So, if there is any history left, can you show me the initial change in loss when you ran 1M before?

Honestly, I feel that there is something fundamentally wrong, not just an iter problem.

Thank you very much.

Any tips or updates here for Universal Hifigan, which is better between fmax=null or fmax=8000 by default?

May 17 '22 06:05 v-nhandt21

hifi-gan hifi-gan copied to clipboard

difficulty of training Universal model with fmax is null

hifi-gan
hifi-gan copied to clipboard