TensorFlowASR Out of memory issue on some GPUs

Out of memory issue on some GPUs

Open Jamesswiz opened this issue 4 years ago • 1 comments

Hello Team, I ran the librespeech contextnet experiment with success on a RTX 3080 and V100 with a batch size of 4 and 8 respectively.

However, the same setup gives me out of memory error on A100 and RTX 3090 GPU with a batch size of 10. With a batchsize of 2 the ETA becomes 2 days per epoch, so definitely something is wrong. At least on A100 I should be able to train on a larger batchsize,

Any ideas on this behaviour? @usimarit I know the default strategy uses set_memory_growth=TRUE but that doesn't seems to help here.

I use TF2.5, CUDA 11.2 (but same behaviour in TF2.4) ASRSliceDataset for dataset, Buffer size=100 and cache=False in config file.

Jun 12 '21 07:06 Jamesswiz

@Jamesswiz Did you use mixed precision? On TPU with 8GB/TPU, I was able to fit batch size 6 for 12M params contextnet pretrained. For GPU it should be around batch size 4 (with mixed precision) 2 days/epoch for contextnet is slow, how was your GPU utilization?

Jun 24 '21 14:06 nglehuy

I’ll close the issue here. Feel free to reopen if you have further questions.

Sep 02 '22 05:09 nglehuy

TensorFlowASR TensorFlowASR copied to clipboard

Out of memory issue on some GPUs

TensorFlowASR
TensorFlowASR copied to clipboard