NeMo
NeMo copied to clipboard
Help with conformer single gpu configuration
Hello, so I have tried training conformer for a couple of times with different audio augmentations but the results seem the same. On training and test audio model performs well, but on my recording or any other audio which was not in the test/training set model performs horribly. I am using ffprobe for validating the encoding sample rate and a number of channels. all audios have Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s this properties. My question is, am I doing something wrong or why does conformer lack generalization?
# It contains the default values for training a Conformer-CTC ASR model, large size (~120M) with CTC loss and sub-word encoding.
# Architecture and training config:
# Default learning parameters in this config are set for effective batch size of 2K. To train it with smaller effective
# batch sizes, you may need to re-tune the learning parameters or use higher accumulate_grad_batches.
# Here are the recommended configs for different variants of Conformer-CTC, other parameters are the same as in this config file.
# One extra layer (compared to original paper) is added to the medium and large variants to compensate for replacing the LSTM decoder with a linear one.
#
# +-------------+---------+---------+----------+------------+-----+
# | Model | d_model | n_heads | n_layers | time_masks | lr |
# +=============+=========+========+===========+============+=====+
# | Small (13M)| 176 | 4 | 16 | 5 | 5.0 |
# +-------------+---------+--------+-----------+------------+-----+
# | Medium (30M)| 256 | 4 | 18 | 5 | 5.0 |
# +-------------+---------+--------+-----------+------------+-----+
# | Large (121M)| 512 | 8 | 18 | 10 | 2.0 |
# +---------------------------------------------------------------+
#
# If you do not want to train with AMP, you may use weight decay of 0.0 or reduce the number of time maskings to 2
# with time_width=100. It may help when you want to train for fewer epochs and need faster convergence.
# With weight_decay=0.0, learning rate may need to get reduced to 2.0.
# You may find more info about Conformer-CTC here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#conformer-ctc
# Pre-trained models of Conformer-CTC can be found here: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html
# The checkpoint of the large model trained on LibriSpeech with this recipe can be found here: https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large_ls
name: "Conformer-CTC-BPE"
model:
sample_rate: 16000
log_prediction: true # enables logging sample predictions in the output during training
ctc_reduction: 'mean_batch'
skip_nan_grad: false
train_ds:
augmentor:
white_noise:
prob: 0.3
min_level: -50
max_level: -10
impulse:
prob: 0.3
manifest_path: /home/ubuntu/ASR/NeMo/scripts/dataset_processing/RIR_DATA/processed/rir.json
shift:
prob: 0.3
min_shift_ms: -5.0
max_shift_ms: 5.0
#time_stretch:
#prob: 0.4
#min_speed_rate: 0.9
#max_speed_rate: 1.1
#num_rates: 5
#n_fft: 512
#gain:
#prob: 0.4
#min_gain_dbfs: -10
#max_gain_dbfs: 10
#========================
#transcode_aug:
#prob: 0.3
#speed:
#prob: 0.3
#resample_type: 'kaiser_best'
#num_rates: 5
#sr: 16000
manifest_filepath: ???
sample_rate: ${model.sample_rate}
batch_size: 32 # you may increase batch_size if your memory allows
shuffle: true
num_workers: 8
pin_memory: true
use_start_end_token: false
trim_silence: false
max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
min_duration: 0.1
# tarred datasets
is_tarred: false
tarred_audio_filepaths: null
shuffle_n: 2048
# bucketing params
bucketing_strategy: "synced_randomized"
bucketing_batch_size: null
validation_ds:
manifest_filepath: ???
sample_rate: ${model.sample_rate}
batch_size: 32 # you may increase batch_size if your memory allows
shuffle: false
num_workers: 8
pin_memory: true
use_start_end_token: false
test_ds:
manifest_filepath: null
sample_rate: ${model.sample_rate}
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: false
num_workers: 8
pin_memory: true
use_start_end_token: false
# recommend small vocab size of 128 or 256 when using 4x sub-sampling
# you may find more detail on how to train a tokenizer at: /scripts/tokenizers/process_asr_text_tokenizer.py
tokenizer:
dir: ??? # path to directory which contains either tokenizer.model (bpe) or vocab.txt (wpe)
type: bpe # Can be either bpe (SentencePiece tokenizer) or wpe (WordPiece tokenizer)
preprocessor:
_target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
sample_rate: ${model.sample_rate}
normalize: "per_feature"
window_size: 0.025
window_stride: 0.01
window: "hann"
features: 80
n_fft: 512
log: true
frame_splicing: 1
dither: 0.00001
pad_to: 0
pad_value: 0.0
spec_augment:
_target_: nemo.collections.asr.modules.SpectrogramAugmentation
#rect_masks: 5 # Number of rectangles to cut from any given spectrogram
#rect_freq: 50 # Max cut of size 50 along the frequency dimension
#rect_time: 120 # Max cut of size 120 along the time dimension
# SpecAugment parameters
freq_masks: 2 # Cut two frequency bands
freq_width: 27 # ... of width 15 at maximum
time_masks: 5 # Cut out 10 time bands
time_width: 0.05 # ... of width 25 at maximum
encoder:
_target_: nemo.collections.asr.modules.ConformerEncoder
feat_in: ${model.preprocessor.features}
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 16
d_model: 176
# Sub-sampling params
subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory
subsampling_factor: 4 # must be power of 2
subsampling_conv_channels: -1 # -1 sets it to d_model
# Feed forward module's params
ff_expansion_factor: 4
# Multi-headed Attention Module's params
self_attention_model: rel_pos # rel_pos or abs_pos
n_heads: 4 # may need to be lower for smaller d_models
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
att_context_size: [-1, -1] # -1 means unlimited context
xscaling: true # scales up the input embeddings by sqrt(d_model)
untie_biases: true # unties the biases of the TransformerXL layers
pos_emb_max_len: 5000
# Convolution module's params
conv_kernel_size: 31
conv_norm_type: 'batch_norm' # batch_norm or layer_norm
### regularization
dropout: 0.1 # The dropout used in most of the Conformer Modules
dropout_emb: 0.0 # The dropout used for embeddings
dropout_att: 0.1 # The dropout for multi-headed attention modules
decoder:
_target_: nemo.collections.asr.modules.ConvASRDecoder
feat_in: null
num_classes: -1
vocabulary: []
optim:
name: adamw
lr: 5.0
# optimizer arguments
betas: [0.9, 0.98]
# less necessity for weight_decay as we already have large augmentations with SpecAug
# you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used
# weight decay of 0.0 with lr of 2.0 also works fine
weight_decay: 0
# scheduler setup
sched:
name: NoamAnnealing
d_model: ${model.encoder.d_model}
# scheduler config override
warmup_steps: 10000
warmup_ratio: null
min_lr: 1e-6
trainer:
devices: -1 # number of GPUs, -1 would use all available GPUs
num_nodes: 1
max_epochs: 1000
max_steps: null # computed at runtime if not set
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
accelerator: auto
strategy: ddp
accumulate_grad_batches: 16
gradient_clip_val: 0.0
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
log_every_n_steps: 10 # Interval of logging.
progress_bar_refresh_rate: 10
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
sync_batchnorm: true
enable_checkpointing: False # Provided by exp_manager
logger: false # Provided by exp_manager
benchmark: false # needs to be false for models with variable-length speech input as it slows down training
exp_manager:
exp_dir: null
name: ${name}
create_tensorboard_logger: true
create_checkpoint_callback: true
checkpoint_callback_params:
# in case of multiple validation sets, first one is used
monitor: "val_wer"
mode: "min"
save_top_k: 5
always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints
# you need to set these two to True to continue the training
resume_if_exists: false
resume_ignore_no_checkpoint: false
# You may use this section to create a W&B logger
create_wandb_logger: false
wandb_logger_kwargs:
name: null
project: null
tokenizer was trained like this:
python $NEMO_ROOT/scripts/tokenizers/process_asr_text_tokenizer.py \
--manifest=/home/ubuntu/data/train.ds \
--data_root="tokenizer/" \
--vocab_size=128 \
--tokenizer="spe" \
--no_lower_case \
--spe_type="unigram" \
--spe_character_coverage=1.0 \
--log
So I guess, what I am really asking is, with what configuration would the model generalize better on unseen data, because right now for some unknown reason for me model drastically overfits on training/ "training like" data and on any other audio it's impossible two get a single world right...
sidenotes: If I use 16 precision most of the time loss becomes non because of rir I guess otherwise it trains well but still has generalization issues. accumulate_grad_batches was increased cause I am training on only 1 gpu with 16 gigs of vram
never mind just trained it, generalisation is a big problem even for large models compared to wav2vec2
I'm on vacation so I just saw this.
Does Wav2Vec do better for your case without training ? What domain of speech are you trying to train / eval on?
Oh sorry for late replay, I was trying to train model on same split as wav2vec2. Domain is generall Georgian Language(data is from youtube videos and other custom labeled ones). Model overfits beyond recognition. It's really weird, valid set got nice scores but anything transcribed beside audios from splits were unreadable.... With same training and same testing wav2vec2 was almost not missing a character.... https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec I dont think that model which is present here https://catalog.ngc.nvidia.com/models is trained on same config cause on english it works perfectly fine on my custom recorded voice. (sample rates and things like that were not issue I have triple checked them)
To clarify, wav2vec2 is pretrained before finetuned, unforunatly ssl of conformer had not any significent improvment