AudioDec vq_loss increase, not converge

I am working on my dataset, whose channels = 2 and sampling rate = 16000. I paste my config file below, the major changes I made are: 1) sample_rate 2) data path 3) input output channel

sampling_rate: &sampling_rate 16000
data:
    path: "../ABCS/Audio"
    subset:
        train: "train"
        valid: "dev"
        test:  "test"

###########################################################
#                   MODEL SETTING                         #
###########################################################
model_type: symAudioDec
train_mode: autoencoder
paradigm: efficient

generator_params:
    input_channels: 2
    output_channels: 2 
    encode_channels: 32
    decode_channels: 32
    code_dim: 64
    codebook_num: 8
    codebook_size: 1024
    bias: true
    enc_ratios: [2, 4, 8, 16]
    dec_ratios: [16, 8, 4, 2]
    enc_strides: [3, 4, 5, 5]
    dec_strides: [5, 5, 4, 3]
    mode: 'causal'
    codec: 'audiodec'
    projector: 'conv1d'
    quantier: 'residual_vq'

discriminator_params:
    scales: 3                              # Number of multi-scale discriminator.
    scale_downsample_pooling: "AvgPool1d"  # Pooling operation for scale discriminator.
    scale_downsample_pooling_params:
        kernel_size: 4                     # Pooling kernel size.
        stride: 2                          # Pooling stride.
        padding: 2                         # Padding size.
    scale_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [15, 41, 5, 3]       # List of kernel sizes.
        channels: 128                      # Initial number of channels.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        max_groups: 16                     # Maximum number of groups in downsampling conv layers.
        bias: true
        downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:
            negative_slope: 0.1
    follow_official_norm: true             # Whether to follow the official norm setting.
    periods: [2, 3, 5, 7, 11]              # List of period for multi-period discriminator.
    period_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [5, 3]               # List of kernel sizes.
        channels: 32                       # Initial number of channels.
        downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        bias: true                         # Whether to use bias parameter in conv layer."
        nonlinear_activation: "LeakyReLU"  # Nonlinear activation.
        nonlinear_activation_params:       # Nonlinear activation paramters.
            negative_slope: 0.1
        use_weight_norm: true              # Whether to apply weight normalization.
        use_spectral_norm: false           # Whether to apply spectral normalization.

###########################################################
#                 METRIC LOSS SETTING                     #
###########################################################
use_mel_loss: true                   # Whether to use Mel-spectrogram loss.
mel_loss_params:
    fs: *sampling_rate
    fft_sizes: [2048]
    hop_sizes: [300]
    win_lengths: [2048]
    window: "hann_window"
    num_mels: 80
    fmin: 0
    fmax: 12000
    log_base: null

use_stft_loss: false                 # Whether to use multi-resolution STFT loss.
stft_loss_params:
    fft_sizes: [1024, 2048, 512]     # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]        # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240]    # List of window length for STFT-based loss.
    window: "hann_window"            # Window function for STFT-based loss

use_shape_loss: false                # Whether to use waveform shape loss.
shape_loss_params:
    winlen: [300]

###########################################################
#                  ADV LOSS SETTING                       #
###########################################################
generator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.

discriminator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.

use_feat_match_loss: true
feat_match_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
    average_by_layers: false         # Whether to average loss by #layers in each discriminator.
    include_final_outputs: false     # Whether to include final outputs in feat match loss calculation.

###########################################################
#                  LOSS WEIGHT SETTING                    #
###########################################################
lambda_adv: 0.1          # Loss weight of adversarial loss.
lambda_feat_match: 2.0   # Loss weight of feat match loss.
lambda_vq_loss: 1.0      # Loss weight of vector quantize loss.
lambda_mel_loss: 45.0    # Loss weight of mel-spectrogram spectloss.
lambda_stft_loss: 45.0   # Loss weight of multi-resolution stft loss.
lambda_shape_loss: 45.0  # Loss weight of multi-window shape loss.
      
###########################################################
#                  DATA LOADER SETTING                    #
###########################################################
batch_size: 64              # Batch size.
batch_length: 9600          # Length of each audio in batch (training w/o adv). Make sure dividable by hop_size.
adv_batch_length: 9600      # Length of each audio in batch (training w/ adv). Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.
num_workers: 8              # Number of workers in Pytorch DataLoader.

###########################################################
#             OPTIMIZER & SCHEDULER SETTING               #
###########################################################
generator_optimizer_type: Adam
generator_optimizer_params:
    lr: 1.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
generator_scheduler_type: StepLR
generator_scheduler_params:
    step_size: 200000      # Generator's scheduler step size.
    gamma: 1.0
generator_grad_norm: -1
discriminator_optimizer_type: Adam
discriminator_optimizer_params:
    lr: 2.0e-4
    betas: [0.5, 0.9]
    weight_decay: 0.0
discriminator_scheduler_type: MultiStepLR
discriminator_scheduler_params:
    gamma: 0.5
    milestones:
        - 200000
        - 400000
        - 600000
        - 800000
discriminator_grad_norm: -1

###########################################################
#                    INTERVAL SETTING                     #
###########################################################
start_steps:                       # Number of steps to start training
    generator: 0
    discriminator: 500000 
train_max_steps: 500000            # Number of training steps. (w/o adv)
adv_train_max_steps: 1000000       # Number of training steps. (w/ adv)
save_interval_steps: 100000        # Interval steps to save checkpoint.
eval_interval_steps: 1000          # Interval steps to evaluate the network.
log_interval_steps: 100            # Interval steps to record the training log.

In the stage 1 training (<500k), the mel_loss seems reasonable, but the vq_loss gets larger and larger, which seems weird.
In the stage 2 training, my mel loss will go much higher. Is the reason 1) I set the wrong lambda_adv or 2) the problem caused by bad vq_loss? What is the recommended way to work on it?

Thank you in advance!

Mar 11 '24 06:03 lixinghe1999

Hi, the vq_loss becoming higher during training is normal since the encoder usually outputs white noise like latent in the beginning. When the encoder starts to learn something meaningful will make the quantization difficult to reconstruct resulting in higher vq_loss.

The mel_loss will also become higher during GAN training since the objective of the GAN training is cheating the discriminator not reducing the mel loss.

However, if the vq_loss or mel_loss did not converge, it is a problem. According to your setting, I think the temporal-resolution downsampling ratio might be too high (enc_strides: [3, 4, 5, 5], dec_strides: [5, 5, 4, 3] make the downsampling ratio=300).

Taking a smaller temopral-resolution downsampling ratio may ease the problem. (For example, enc_strides: [2, 3, 4, 5], dec_strides: [5, 4, 3, 2])

Mar 11 '24 14:03 bigpon

Taking a smaller temopral-resolution downsampling ratio may ease the problem. (For example, enc_strides: [2, 3, 4, 5], dec_strides: [5, 4, 3, 2])

I donot think so, because [2,3,4,5] means the downsampling ratio=120, 9600/120=80 > 64(codebook_dim)

Apr 08 '24 12:04 a897456

same question https://github.com/facebookresearch/AudioDec/issues/19 I would like to know how to adjust the parameters in config to achieve the best output for 16kHz input data. How did you finally adjust it? @lixinghe1999

Apr 08 '24 12:04 a897456

because [2,3,4,5] means the downsampling ratio=120, 9600/120=80 > 64(codebook_dim)

Hi, the downsampling is for the temporal axis, so it should be 48000 (48kHz)/120=400Hz of the codes, which is different from the code dimension 64. That is, for each second, you will get 400 * 64 * (number of RVQ, here is 8).

Apr 16 '24 08:04 bigpon

batch_length: 9600

yes, but batch_length: 9600 9600/120=80, so I think the stride should be changed with batch_length

Apr 16 '24 11:04 a897456

from my understanding, the batch_length only influences the gpu memory consumption so normally we don't need to worry about it (as long as it can be divided by downsample rate). the codebook dim you mentioned seems only work on the single time-frame, not relevant to the batch_length. please correct me if i am wrong.

Apr 16 '24 15:04 lixinghe1999

Yes, the batch_length is more related to the GPU useage, and the only requirement is that it can be divided by the downsample rate.

I actually found that the longer batch_length the better performance, which is straightforward, but the longer batch_length results in much longer training time in the second stage (w/ the GAN training).

However, the longer batch_length do not significantly increase the training time in the first stage, so I use 96000 in the 1st stage and 9600 in the 2nd stage in my latest settings.

Apr 16 '24 17:04 bigpon

AudioDec AudioDec copied to clipboard

vq_loss increase, not converge

AudioDec
AudioDec copied to clipboard