NeMo
NeMo copied to clipboard
Cuda Out of memory
Hi,
I am trying to train a conformer-transducer model on LibriSpeech but I am getting out of memory error. I am using the conformer_transducer_bpe.yaml config to initialize a EncDecRNNTBPEModel class (also tried the char version) with batch_size=8 on a NVIDIA K80 GPU with 12GB of memory and this is the error I am getting:
`...
File "/home/ec2-user/anaconda3/envs/nemo/lib/python3.8/site-packages/nemo/collections/asr/modules/rnnt.py", line 992, in joint
res = self.joint_net(inp) # [B, T, U, V + 1]
File "/home/ec2-user/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/ec2-user/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/nemo/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 2.16 GiB (GPU 0; 11.17 GiB total capacity; 8.88 GiB already allocated; 1.76 GiB free; 9.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`
I managed to train a EncDecCTCModelBPE model with much larger number of parameters on the same dataset with a larger batch size but I am not sure why I am getting memory error for a smaller model.
You might need to use a smaller batch sizes of 4 and use grad accumulation instead. RNNT model takes much more memory than CTC
You may reduce the fused_batch_size to reduce the memory consumption.
Batch of size of 2 does not throw an error. I think I have to stick to that. Thanks.
This issue is stale because it has been open for 60 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.