mmocr
mmocr copied to clipboard
Model parameters require gradients during distributed testing
trafficstars
Describe the bug Transformer-based models can cause CUDA out of memory error when evaluated on multiple GPUs.
Reproduction
Set the max_seq_len to a large number, e.g. 1000, then evaluate a transformer-based model on multiple GPUs.
Environment
pytorch 1.6.0
Bug fix
The problem is caused by:
- DistributedDataParallel does not accept a model with every parameter's
requires_gradproperty set toFalse. - Model parameters require gradients since (MM)DistributedDataParallel is used for distributed testing.
- CUDA memory space for storing gradients will keep growing during the whole decoding loop.
torch.no_grad()is not helpful for this.
So, an inelegant fix for this problem is:
for param in model.parameters():
param.requires_grad = False
for param in model.parameters():
param.requires_grad = True
break
However, maybe an automatic scheduling script is better instead of using DistributedDataParallel.
Or, is this problem fixed by the latest version of pytorch?
Thanks for reporting that. It sounds like a feature that MMCV should support? @zhouzaida