[BUG] GPU RAM increases during training, causing OOM
Hi, I’m using the notebook from the Cellpose github (cellpose/notebooks/train_Cellpose-SAM.ipynb at c68d52e08162430c82677bb0bab66a68b95f1898 · MouseLand/cellpose · GitHub) to train my dataset on Google Colab. Problem: the GPU RAM is increasing during training (see attached screenshot), which makes my session crash after ~200 iterations.
I'm using all the default training parameters, exactly like in the notebook. Only diff is my dataset, which consists of 400 training images (jpg format) with corresponding instance segmentation masks (png format). I am using an L4 GPU.
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
[<ipython-input-8-e81981ad1efd>](https://localhost:8080/#) in <cell line: 0>()
19
20 start = time.time()
---> 21 new_model_path, train_losses, test_losses = train.train_seg(model.net,
22 train_data=train_data,
23 train_labels=train_labels,
3 frames
[/usr/local/lib/python3.11/dist-packages/cellpose/dynamics.py](https://localhost:8080/#) in masks_to_flows_gpu(masks, device, niter)
118 y = y.int()
119 x = x.int()
--> 120 neighbors = torch.zeros((2, 9, y.shape[0]), dtype=torch.int, device=device)
121 yxi = [[0, -1, 1, 0, 0, -1, -1, 1, 1], [0, 0, 0, -1, 1, -1, 1, -1, 1]]
122 for i in range(9):
OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 47.38 MiB is free. Process 5506 has 22.11 GiB memory in use. Of the allocated memory 21.83 GiB is allocated by PyTorch, and 62.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Any ideas what could cause this ?
@emoebel GPU is crashed mostly due to 1) model size cannot fit into memory; 2) data size can not fit; Obviously, your overloading issue comes from the data size is cached in your memory after too many iteration. can you lower the epoch to 50 and then see what will happen? Also, can you check what the shape and size of your image are?
I already lowered the number of epochs to 1. Actually, in my case there are 400 iterations per epoch, and it crashes at iteration 200, so it doesn't even finish 1 epoch. My training set is composed of 400 images with variable sizes, ranging from 512 to 800 approximately. They all have 3 channels.
Isn't GPU RAM supposed to remain constant through the training ?
@emoebel In general GPU RAM won't remain constant during training with cellpose because the network output is only an intermediate representation and has to be processed to produce masks. There are built in pre and post processing steps that use the GPU and we need to move data around to accomplish that. That said, there are likely still memory inefficiencies and finding them would be helpful.
Reducing epochs isn't a great idea since you won't be able to train a model with too few epochs. Instead I'd recommend using less data per epoch with a lower nimg_per_epoch, maybe 100. You can offset using less data per epoch with training for longer
closing due to inactivity