TensorFlowASR
TensorFlowASR copied to clipboard
Out of memory issue on some GPUs
Hello Team, I ran the librespeech contextnet experiment with success on a RTX 3080 and V100 with a batch size of 4 and 8 respectively.
However, the same setup gives me out of memory error on A100 and RTX 3090 GPU with a batch size of 10. With a batchsize of 2 the ETA becomes 2 days per epoch, so definitely something is wrong. At least on A100 I should be able to train on a larger batchsize,
Any ideas on this behaviour? @usimarit I know the default strategy uses set_memory_growth=TRUE but that doesn't seems to help here.
I use TF2.5, CUDA 11.2 (but same behaviour in TF2.4) ASRSliceDataset for dataset, Buffer size=100 and cache=False in config file.
@Jamesswiz Did you use mixed precision? On TPU with 8GB/TPU, I was able to fit batch size 6 for 12M params contextnet pretrained. For GPU it should be around batch size 4 (with mixed precision) 2 days/epoch for contextnet is slow, how was your GPU utilization?
I’ll close the issue here. Feel free to reopen if you have further questions.