openspeech
openspeech copied to clipboard
Could there be a memory leak in the conformer_lstm model?
❓ Questions & Help
I am training using the random sampler and using ddp_sharded strategy, however after some training steps I get a CUDA out-of-memory error.
Details
I am training on a SLURM-managed cluster using 2 nodes with 2 Tesla M60 (8GB) GPUs each. As I understand if the model doesnt fit in a single GPU, pytorch-lightning would automatically scatter it across different ones.
As expected, the larger I make the batch size the sooner the error occurs. What I find strange is that it takes a few iterations to crash, if the model and the batch didn't fit wouldn't it crash directly?
Here I attach a picture of the memory usage:
And these are the parameters I'm using:
audio:
name: fbank
sample_rate: 16000
frame_length: 20.0
frame_shift: 10.0
del_silence: false
num_mels: 80
apply_spec_augment: true
apply_noise_augment: false
apply_time_stretch_augment: false
apply_joining_augment: false
augment:
apply_spec_augment: false
apply_noise_augment: false
apply_joining_augment: false
apply_time_stretch_augment: false
freq_mask_para: 27
freq_mask_num: 2
time_mask_num: 4
noise_dataset_dir: None
noise_level: 0.7
time_stretch_min_rate: 0.7
time_stretch_max_rate: 1.4
dataset:
dataset: librispeech
dataset_path: /home/ubuntu/data/librispeech
dataset_download: false
manifest_file_path: /home/ubuntu/data/librispeech/libri_subword_manifest.txt
criterion:
criterion_name: cross_entropy
reduction: mean
lr_scheduler:
lr: 0.0001
scheduler_name: warmup_reduce_lr_on_plateau
lr_patience: 1
lr_factor: 0.3
peak_lr: 0.0001
init_lr: 1.0e-10
warmup_steps: 4000
model:
model_name: conformer_lstm
encoder_dim: 256
num_encoder_layers: 6
num_attention_heads: 4
feed_forward_expansion_factor: 4
conv_expansion_factor: 2
input_dropout_p: 0.1
feed_forward_dropout_p: 0.1
attention_dropout_p: 0.1
conv_dropout_p: 0.1
conv_kernel_size: 31
half_step_residual: true
num_decoder_layers: 2
decoder_dropout_p: 0.1
max_length: 128
teacher_forcing_ratio: 1.0
rnn_type: lstm
decoder_attn_mechanism: loc
optimizer: adam
trainer:
seed: 1
accelerator: ddp_sharded # This I have hardcoded the necessary parts, I basically say use 2 nodes with 2 gpus each
accumulate_grad_batches: 1
num_workers: 4
batch_size: 16
check_val_every_n_epoch: 1
gradient_clip_val: 5.0
logger: wandb
max_epochs: 20
save_checkpoint_n_steps: 10000
auto_scale_batch_size: binsearch
sampler: random
name: gpu
device: gpu
use_cuda: true
auto_select_gpus: true
tokenizer:
sos_token: <s>
eos_token: </s>
pad_token: <pad>
blank_token: <blank>
encoding: utf-8
Thank you a lot guys!
When the audio input length is long, the memory seems to explode.
@sooftware so is it designed in such way that the input length increases within an epoch?
That's not true. Perhaps every time a memory is held in a GPU, the memory held in a cache increases and seems to be increasing. Then the memory explodes. (I think)
Any ideas on how to solve it?
same problem encountered, any ideas to slove?
Hi, Not sure if it's really a memory leak, as the audio batches can have different length during training. In my case I had the GPU mem % increasing then decreasing during the training.
Have you tried to decrease the batch size in order to have some room left for longer sequence batches ?
I think @virgile-blg is right, although it is a bit weird that it keeps increasing over the first epochs and then it becomes more stable
One option is to disable Pytorch Lightning's auto_scale_batch_size
. When set to False
there is not OOM error during the 1st epoch.
I guess that it is scaling the batch size not using the biggest sequence in the training set.