fasterrcnn-pytorch-training-pipeline CUDA out of memorie: resnet101

When I use the fasterrcnn model with backbone resnet101, I always have CUDA out of memory in the "evaluate" after one epoch. in the evaluate itself it gives an error at output=model(images). However, only after a few loops. So there is a leek to the CUDA memory somewhere I think. Further, my batch size is already only 1 and my images are 600 x600 pixels. any recommendations? (my CUDA memory is 8GB)

Thanks in advance.

Feb 15 '23 09:02 EmmaVanPuyenbr

@EmmaVanPuyenbr Hi. Does it complete entirely one epoch (training + validation)? Or the error happens on the first epoch only after the training loop completes?

Feb 15 '23 10:02 sovit-123

so it finishes the training (torch_utils.engine : train_one_epoch) but it stops in the evaluation (evaluate from torch_utils.engine).

Feb 15 '23 10:02 EmmaVanPuyenbr

Can you check your GPU memory usage once while training? After the training loop, there will be a slight surge in the GPU memory usage when the validation loop starts. After that, it will reduce (less than what is needed for training). Maybe it is at the brink of maximum memory and is not able to allocate that small amount for a few seconds when the validation loop starts.

Feb 15 '23 10:02 sovit-123

I did that, but it can run a few loops in the evaluate so I don't think there is the problem. it runs 5 loops of the evaluation (so output=model(images) runs 5 times but after that I get an error. I also tried smaller tiles (400 x 400) but it still gives an error

Feb 15 '23 10:02 EmmaVanPuyenbr

Ok. Can you let me know the GPU model and the memory usage during the training loop?

Feb 15 '23 11:02 sovit-123

GPU 0: NVIDIA GeForce RTX 3060 Ti and around 5 GB

Feb 15 '23 11:02 EmmaVanPuyenbr

Oh. I too do all experiments on an RTX GPU. But I have never faced this. May take some time to figure this out if I am unable to reproduce it soon.

Feb 15 '23 11:02 sovit-123

Well, I don't have the CUDA out of memory if I use a Resnet50 model or mobilenetv3 large.. it's only with the Resnet101 so far.

Feb 15 '23 13:02 EmmaVanPuyenbr

Ok. That's interesting. Because while creating the codebase, I took care of checking all places where memory overflow may happen and I never faced the issues. Anyways, thanks for reporting this. I will try to look into it.

Feb 15 '23 13:02 sovit-123

I tested it again and did not face any issues. The GPU memory usage remained exactly the same during the validation loop and did not increase at all.

Feb 15 '23 16:02 sovit-123

fasterrcnn-pytorch-training-pipeline fasterrcnn-pytorch-training-pipeline copied to clipboard

CUDA out of memorie: resnet101

fasterrcnn-pytorch-training-pipeline
fasterrcnn-pytorch-training-pipeline copied to clipboard