NeMo Help with conformer single gpu configuration

trafficstars

Hello, so I have tried training conformer for a couple of times with different audio augmentations but the results seem the same. On training and test audio model performs well, but on my recording or any other audio which was not in the test/training set model performs horribly. I am using ffprobe for validating the encoding sample rate and a number of channels. all audios have Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s this properties. My question is, am I doing something wrong or why does conformer lack generalization?

# It contains the default values for training a Conformer-CTC ASR model, large size (~120M) with CTC loss and sub-word encoding.

# Architecture and training config:
# Default learning parameters in this config are set for effective batch size of 2K. To train it with smaller effective
# batch sizes, you may need to re-tune the learning parameters or use higher accumulate_grad_batches.
# Here are the recommended configs for different variants of Conformer-CTC, other parameters are the same as in this config file.
# One extra layer (compared to original paper) is added to the medium and large variants to compensate for replacing the LSTM decoder with a linear one.
#
#  +-------------+---------+---------+----------+------------+-----+
#  | Model       | d_model | n_heads | n_layers | time_masks | lr  |
#  +=============+=========+========+===========+============+=====+
#  | Small  (13M)|   176   |    4   |    16     |     5      | 5.0 |
#  +-------------+---------+--------+-----------+------------+-----+
#  | Medium (30M)|   256   |    4   |    18     |     5      | 5.0 |
#  +-------------+---------+--------+-----------+------------+-----+
#  | Large (121M)|   512   |    8   |    18     |     10     | 2.0 |
#  +---------------------------------------------------------------+
#
# If you do not want to train with AMP, you may use weight decay of 0.0 or reduce the number of time maskings to 2
# with time_width=100. It may help when you want to train for fewer epochs and need faster convergence.
# With weight_decay=0.0, learning rate may need to get reduced to 2.0.

# You may find more info about Conformer-CTC here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#conformer-ctc
# Pre-trained models of Conformer-CTC can be found here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html
# The checkpoint of the large model trained on LibriSpeech with this recipe can be found here: https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large_ls

name: "Conformer-CTC-BPE"

model:
  sample_rate: 16000
  log_prediction: true # enables logging sample predictions in the output during training
  ctc_reduction: 'mean_batch'
  skip_nan_grad: false

  train_ds:
    augmentor:
      white_noise:
        prob: 0.3
        min_level: -50
        max_level: -10
      impulse:
        prob: 0.3
        manifest_path: /home/ubuntu/ASR/NeMo/scripts/dataset_processing/RIR_DATA/processed/rir.json
      shift:
        prob: 0.3
        min_shift_ms: -5.0
        max_shift_ms: 5.0

        #time_stretch:
        #prob: 0.4
        #min_speed_rate: 0.9
        #max_speed_rate: 1.1
        #num_rates: 5
        #n_fft: 512

        #gain:
        #prob: 0.4
        #min_gain_dbfs: -10
        #max_gain_dbfs: 10
        #========================
        #transcode_aug:
        #prob: 0.3

        #speed:
        #prob: 0.3
        #resample_type: 'kaiser_best'
        #num_rates: 5
        #sr: 16000

    manifest_filepath: ???
    sample_rate: ${model.sample_rate}
    batch_size: 32 # you may increase batch_size if your memory allows
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
    min_duration: 0.1
    # tarred datasets
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    # bucketing params
    bucketing_strategy: "synced_randomized"
    bucketing_batch_size: null

  validation_ds:
    manifest_filepath: ???
    sample_rate: ${model.sample_rate}
    batch_size: 32 # you may increase batch_size if your memory allows
    shuffle: false
    num_workers: 8
    pin_memory: true
    use_start_end_token: false

  test_ds:
    manifest_filepath: null
    sample_rate: ${model.sample_rate}
    batch_size: 16 # you may increase batch_size if your memory allows
    shuffle: false
    num_workers: 8
    pin_memory: true
    use_start_end_token: false

  # recommend small vocab size of 128 or 256 when using 4x sub-sampling
  # you may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
  tokenizer:
    dir: ???  # path to directory which contains either tokenizer.model (bpe) or vocab.txt (wpe)
    type: bpe  # Can be either bpe (SentencePiece tokenizer) or wpe (WordPiece tokenizer)

  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    sample_rate: ${model.sample_rate}
    normalize: "per_feature"
    window_size: 0.025
    window_stride: 0.01
    window: "hann"
    features: 80
    n_fft: 512
    log: true
    frame_splicing: 1
    dither: 0.00001
    pad_to: 0
    pad_value: 0.0

  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    #rect_masks: 5   # Number of rectangles to cut from any given spectrogram
    #rect_freq: 50   # Max cut of size 50 along the frequency dimension
    #rect_time: 120  # Max cut of size 120 along the time dimension
    # SpecAugment parameters
    freq_masks: 2   # Cut two frequency bands
    freq_width: 27  # ... of width 15 at maximum
    time_masks: 5    # Cut out 10 time bands
    time_width: 0.05 # ... of width 25 at maximum

  encoder:
    _target_: nemo.collections.asr.modules.ConformerEncoder
    feat_in: ${model.preprocessor.features}
    feat_out: -1 # you may set it if you need different output size other than the default d_model
    n_layers: 16
    d_model: 176

    # Sub-sampling params
    subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory
    subsampling_factor: 4 # must be power of 2
    subsampling_conv_channels: -1 # -1 sets it to d_model

    # Feed forward module's params
    ff_expansion_factor: 4

    # Multi-headed Attention Module's params
    self_attention_model: rel_pos # rel_pos or abs_pos
    n_heads: 4 # may need to be lower for smaller d_models
    # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
    att_context_size: [-1, -1] # -1 means unlimited context
    xscaling: true # scales up the input embeddings by sqrt(d_model)
    untie_biases: true # unties the biases of the TransformerXL layers
    pos_emb_max_len: 5000

    # Convolution module's params
    conv_kernel_size: 31
    conv_norm_type: 'batch_norm' # batch_norm or layer_norm

    ### regularization
    dropout: 0.1 # The dropout used in most of the Conformer Modules
    dropout_emb: 0.0 # The dropout used for embeddings
    dropout_att: 0.1 # The dropout for multi-headed attention modules

  decoder:
    _target_: nemo.collections.asr.modules.ConvASRDecoder
    feat_in: null
    num_classes: -1
    vocabulary: []

  optim:
    name: adamw
    lr: 5.0
    # optimizer arguments
    betas: [0.9, 0.98]
    # less necessity for weight_decay as we already have large augmentations with SpecAug
    # you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used
    # weight decay of 0.0 with lr of 2.0 also works fine
    weight_decay: 0

    # scheduler setup
    sched:
      name: NoamAnnealing
      d_model: ${model.encoder.d_model}
      # scheduler config override
      warmup_steps: 10000
      warmup_ratio: null
      min_lr: 1e-6

trainer:
  devices: -1 # number of GPUs, -1 would use all available GPUs
  num_nodes: 1
  max_epochs: 1000
  max_steps: null # computed at runtime if not set
  val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
  accelerator: auto
  strategy: ddp
  accumulate_grad_batches: 16
  gradient_clip_val: 0.0
  precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
  log_every_n_steps: 10  # Interval of logging.
  progress_bar_refresh_rate: 10
  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
  num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
  check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
  sync_batchnorm: true
  enable_checkpointing: False  # Provided by exp_manager
  logger: false  # Provided by exp_manager
  benchmark: false # needs to be false for models with variable-length speech input as it slows down training

exp_manager:
  exp_dir: null
  name: ${name}
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    # in case of multiple validation sets, first one is used
    monitor: "val_wer"
    mode: "min"
    save_top_k: 5
    always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints

  # you need to set these two to True to continue the training
  resume_if_exists: false
  resume_ignore_no_checkpoint: false

  # You may use this section to create a W&B logger
  create_wandb_logger: false
  wandb_logger_kwargs:
    name: null
    project: null

tokenizer was trained like this:

python $NEMO_ROOT/scripts/tokenizers/process_asr_text_tokenizer.py \
        --manifest=/home/ubuntu/data/train.ds \
        --data_root="tokenizer/" \
        --vocab_size=128 \
        --tokenizer="spe" \
        --no_lower_case \
        --spe_type="unigram" \
        --spe_character_coverage=1.0 \
        --log

So I guess, what I am really asking is, with what configuration would the model generalize better on unseen data, because right now for some unknown reason for me model drastically overfits on training/ "training like" data and on any other audio it's impossible two get a single world right...

sidenotes: If I use 16 precision most of the time loss becomes non because of rir I guess otherwise it trains well but still has generalization issues. accumulate_grad_batches was increased cause I am training on only 1 gpu with 16 gigs of vram

Aug 01 '22 12:08 ZurabDz

never mind just trained it, generalisation is a big problem even for large models compared to wav2vec2

Aug 23 '22 13:08 ZurabDz

I'm on vacation so I just saw this.

Does Wav2Vec do better for your case without training ? What domain of speech are you trying to train / eval on?

Aug 23 '22 17:08 titu1994

Oh sorry for late replay, I was trying to train model on same split as wav2vec2. Domain is generall Georgian Language(data is from youtube videos and other custom labeled ones). Model overfits beyond recognition. It's really weird, valid set got nice scores but anything transcribed beside audios from splits were unreadable.... With same training and same testing wav2vec2 was almost not missing a character.... https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec I dont think that model which is present here https://catalog.ngc.nvidia.com/models is trained on same config cause on english it works perfectly fine on my custom recorded voice. (sample rates and things like that were not issue I have triple checked them)

To clarify, wav2vec2 is pretrained before finetuned, unforunatly ssl of conformer had not any significent improvment

Sep 01 '22 17:09 ZurabDz

NeMo NeMo copied to clipboard

Help with conformer single gpu configuration

NeMo
NeMo copied to clipboard