ParallelWaveGAN icon indicating copy to clipboard operation
ParallelWaveGAN copied to clipboard

In stylemelgan, G net and D net joint training, voice quality decreased

Open huhuqwaszxedc opened this issue 2 years ago • 4 comments

   Hi,sir,I have a problem,and Could you help me?I choose the generative network of stylemelgan and the discrimination of melgan. When I pre train the generated network, the voice quality is improved, but after GD joint training, the voice quality decreases. Is D network holding back? Moreover, the generated voice of my sentence is very short. Under 16KHz sampling, there are only 2880 sample points in a sentence. Will 4 times downsampling pooling layers in D network mislead the generation of G network?
  Here is my training loss

图片

huhuqwaszxedc avatar Apr 05 '22 07:04 huhuqwaszxedc

I could not give a comment from only this figure. Please attach your config and share your dataset detail.

kan-bayashi avatar Apr 05 '22 13:04 kan-bayashi

Thank you very much for your reply. Based on your source code, I try to use stylemelgan generator and melgan multi-scale discriminatior to speech packet loss concealment. The dataset selects 11000 speech sentences from librispeech.

sampling_rate: 16000     # Sampling rate.
fft_size: 1024           # FFT size.
hop_size: 160            # Hop size.
win_length: null         # Window length.
                         # If set to null, it will be the same as fft_size.
window: "hann"           # Window function.
num_mels: 80             # Number of mel basis.
fmin: 80                 # Minimum freq in mel basis calculation.
fmax: 7600               # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0   # Will be multiplied to all of waveform.
trim_silence: true       # Whether to trim the start and end of silence.
trim_threshold_in_db: 20 # Need to tune carefully if the recording is not good.
trim_frame_size: 1024    # Frame size in trimming.
trim_hop_size: 160       # Hop size in trimming.

discriminator_type: "MelGANMultiScaleDiscriminator" # Discriminator type.
discriminator_params:
    in_channels: 1                    # Number of input channels.
    out_channels: 1                   # Number of output channels.
    scales: 3                         # Number of multi-scales.
    downsample_pooling: "AvgPool1d"   # Pooling type for the input downsampling.
    downsample_pooling_params:        # Parameters of the above pooling function.
        kernel_size: 4
        stride: 2
        padding: 1
        count_include_pad: False
    kernel_sizes: [5, 3]              # List of kernel size.
    channels: 16                      # Number of channels of the initial conv layer.
    max_downsample_channels: 512      # Maximum number of channels of downsampling layers.
    downsample_scales: [4, 4, 4]      # List of downsampling scales.
    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
    nonlinear_activation_params:      # Parameters of nonlinear activation function.
        negative_slope: 0.2
    use_weight_norm: True             # Whether to use weight norm.

generator_type: "StyleMelGANGenerator" # Generator type.
generator_params:
    in_channels: 128
    aux_channels: 80
    channels: 64
    out_channels: 1
    kernel_size: 9
    dilation: 2
    bias: True
    noise_upsample_scales: [10, 2, 2, 2]
    noise_upsample_activation: "LeakyReLU"
    noise_upsample_activation_params:
        negative_slope: 0.2
    upsample_scales: [5, 1, 2, 1, 2, 2, 2, 2]
    upsample_mode: "nearest"
    gated_function: "softmax"
    use_weight_norm: True

**batch_size: 32              # Batch size.
batch_max_steps: 2880      # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.**

stft_loss_params:
    fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
    window: "hann_window"         # Window function for STFT-based loss
use_subband_stft_loss: true
subband_stft_loss_params:
    fft_sizes: [384, 683, 171]  # List of FFT size for STFT-based loss.
    hop_sizes: [30, 60, 10]     # List of hop size for STFT-based loss
    win_lengths: [150, 300, 60] # List of window length for STFT-based loss.
    window: "hann_window"       # Window function for STFT-based loss

use_feat_match_loss: false # Whether to use feature matching loss.
lambda_adv: 3            # Loss balancing coefficient for adversarial loss.

huhuqwaszxedc avatar Apr 06 '22 01:04 huhuqwaszxedc

  • batch_max_steps seems too short.
  • What is your intension of the use of different discriminator? Did you try the default combination? If not, you should try it at first.

kan-bayashi avatar Apr 11 '22 00:04 kan-bayashi

Thank you sir,i already fixed this problem.

---Original--- From: "Tomoki @.> Date: Tue, Apr 5, 2022 21:32 PM To: @.>; Cc: @.@.>; Subject: Re: [kan-bayashi/ParallelWaveGAN] In stylemelgan, G net and D net joint training, voice quality decreased (Issue #350)

I could not give a comment from only this figure. Please attach your config and share your dataset detail.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

huhuqwaszxedc avatar Apr 12 '22 03:04 huhuqwaszxedc