ParallelWaveGAN In stylemelgan, G net and D net joint training, voice quality decreased

   Hi，sir,I have a problem,and Could you help me?I choose the generative network of stylemelgan and the discrimination of melgan. When I pre train the generated network, the voice quality is improved, but after GD joint training, the voice quality decreases. Is D network holding back? Moreover, the generated voice of my sentence is very short. Under 16KHz sampling, there are only 2880 sample points in a sentence. Will 4 times downsampling pooling layers in D network mislead the generation of G network?
  Here is my training loss

Apr 05 '22 07:04 huhuqwaszxedc

I could not give a comment from only this figure. Please attach your config and share your dataset detail.

Apr 05 '22 13:04 kan-bayashi

Thank you very much for your reply. Based on your source code, I try to use stylemelgan generator and melgan multi-scale discriminatior to speech packet loss concealment. The dataset selects 11000 speech sentences from librispeech.

sampling_rate: 16000     # Sampling rate.
fft_size: 1024           # FFT size.
hop_size: 160            # Hop size.
win_length: null         # Window length.
                         # If set to null, it will be the same as fft_size.
window: "hann"           # Window function.
num_mels: 80             # Number of mel basis.
fmin: 80                 # Minimum freq in mel basis calculation.
fmax: 7600               # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0   # Will be multiplied to all of waveform.
trim_silence: true       # Whether to trim the start and end of silence.
trim_threshold_in_db: 20 # Need to tune carefully if the recording is not good.
trim_frame_size: 1024    # Frame size in trimming.
trim_hop_size: 160       # Hop size in trimming.

discriminator_type: "MelGANMultiScaleDiscriminator" # Discriminator type.
discriminator_params:
    in_channels: 1                    # Number of input channels.
    out_channels: 1                   # Number of output channels.
    scales: 3                         # Number of multi-scales.
    downsample_pooling: "AvgPool1d"   # Pooling type for the input downsampling.
    downsample_pooling_params:        # Parameters of the above pooling function.
        kernel_size: 4
        stride: 2
        padding: 1
        count_include_pad: False
    kernel_sizes: [5, 3]              # List of kernel size.
    channels: 16                      # Number of channels of the initial conv layer.
    max_downsample_channels: 512      # Maximum number of channels of downsampling layers.
    downsample_scales: [4, 4, 4]      # List of downsampling scales.
    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
    nonlinear_activation_params:      # Parameters of nonlinear activation function.
        negative_slope: 0.2
    use_weight_norm: True             # Whether to use weight norm.

generator_type: "StyleMelGANGenerator" # Generator type.
generator_params:
    in_channels: 128
    aux_channels: 80
    channels: 64
    out_channels: 1
    kernel_size: 9
    dilation: 2
    bias: True
    noise_upsample_scales: [10, 2, 2, 2]
    noise_upsample_activation: "LeakyReLU"
    noise_upsample_activation_params:
        negative_slope: 0.2
    upsample_scales: [5, 1, 2, 1, 2, 2, 2, 2]
    upsample_mode: "nearest"
    gated_function: "softmax"
    use_weight_norm: True

**batch_size: 32              # Batch size.
batch_max_steps: 2880      # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.**

stft_loss_params:
    fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
    window: "hann_window"         # Window function for STFT-based loss
use_subband_stft_loss: true
subband_stft_loss_params:
    fft_sizes: [384, 683, 171]  # List of FFT size for STFT-based loss.
    hop_sizes: [30, 60, 10]     # List of hop size for STFT-based loss
    win_lengths: [150, 300, 60] # List of window length for STFT-based loss.
    window: "hann_window"       # Window function for STFT-based loss

use_feat_match_loss: false # Whether to use feature matching loss.
lambda_adv: 3            # Loss balancing coefficient for adversarial loss.

Apr 06 '22 01:04 huhuqwaszxedc

batch_max_steps seems too short.
What is your intension of the use of different discriminator? Did you try the default combination? If not, you should try it at first.

Apr 11 '22 00:04 kan-bayashi

Thank you sir，i already fixed this problem.

---Original--- From: "Tomoki @.> Date: Tue, Apr 5, 2022 21:32 PM To: @.>; Cc: @.@.>; Subject: Re: [kan-bayashi/ParallelWaveGAN] In stylemelgan, G net and D net joint training, voice quality decreased (Issue #350)

I could not give a comment from only this figure. Please attach your config and share your dataset detail.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Apr 12 '22 03:04 huhuqwaszxedc

ParallelWaveGAN ParallelWaveGAN copied to clipboard

In stylemelgan, G net and D net joint training, voice quality decreased

ParallelWaveGAN
ParallelWaveGAN copied to clipboard