mmocr Model parameters require gradients during distributed testing

Model parameters require gradients during distributed testing

Open yuyq96 opened this issue 3 years ago • 1 comments

trafficstars

Describe the bug Transformer-based models can cause CUDA out of memory error when evaluated on multiple GPUs.

Reproduction

Set the max_seq_len to a large number, e.g. 1000, then evaluate a transformer-based model on multiple GPUs.

Environment

pytorch 1.6.0

Bug fix

The problem is caused by:

DistributedDataParallel does not accept a model with every parameter's requires_grad property set to False.
Model parameters require gradients since (MM)DistributedDataParallel is used for distributed testing.
CUDA memory space for storing gradients will keep growing during the whole decoding loop.
torch.no_grad() is not helpful for this.

So, an inelegant fix for this problem is:

for param in model.parameters():
    param.requires_grad = False
for param in model.parameters():
    param.requires_grad = True
    break

However, maybe an automatic scheduling script is better instead of using DistributedDataParallel.

Or, is this problem fixed by the latest version of pytorch?

Jan 02 '22 09:01 yuyq96

Thanks for reporting that. It sounds like a feature that MMCV should support? @zhouzaida

Jan 03 '22 15:01 gaotongxiao