fasterrcnn-pytorch-training-pipeline
fasterrcnn-pytorch-training-pipeline copied to clipboard
CUDA out of memorie: resnet101
When I use the fasterrcnn model with backbone resnet101, I always have CUDA out of memory in the "evaluate" after one epoch. in the evaluate itself it gives an error at output=model(images). However, only after a few loops. So there is a leek to the CUDA memory somewhere I think. Further, my batch size is already only 1 and my images are 600 x600 pixels. any recommendations? (my CUDA memory is 8GB)
Thanks in advance.
@EmmaVanPuyenbr Hi. Does it complete entirely one epoch (training + validation)? Or the error happens on the first epoch only after the training loop completes?
so it finishes the training (torch_utils.engine : train_one_epoch) but it stops in the evaluation (evaluate from torch_utils.engine).
Can you check your GPU memory usage once while training? After the training loop, there will be a slight surge in the GPU memory usage when the validation loop starts. After that, it will reduce (less than what is needed for training). Maybe it is at the brink of maximum memory and is not able to allocate that small amount for a few seconds when the validation loop starts.
I did that, but it can run a few loops in the evaluate so I don't think there is the problem. it runs 5 loops of the evaluation (so output=model(images) runs 5 times but after that I get an error. I also tried smaller tiles (400 x 400) but it still gives an error
Ok. Can you let me know the GPU model and the memory usage during the training loop?
GPU 0: NVIDIA GeForce RTX 3060 Ti and around 5 GB
Oh. I too do all experiments on an RTX GPU. But I have never faced this. May take some time to figure this out if I am unable to reproduce it soon.
Well, I don't have the CUDA out of memory if I use a Resnet50 model or mobilenetv3 large.. it's only with the Resnet101 so far.
Ok. That's interesting. Because while creating the codebase, I took care of checking all places where memory overflow may happen and I never faced the issues. Anyways, thanks for reporting this. I will try to look into it.
I tested it again and did not face any issues. The GPU memory usage remained exactly the same during the validation loop and did not increase at all.