detecto icon indicating copy to clipboard operation
detecto copied to clipboard

CUDA out of memory error during model.fit()

Open TimurNurlygayanov opened this issue 3 years ago • 2 comments

I'm trying basic example from https://towardsdatascience.com/build-a-custom-trained-object-detection-model-with-5-lines-of-code-713ba7f6c0fb

my video card is "NVIDIA GeForce MX150" (laptop) with 2 Gb video RAM. OS: ubuntu 20.04 + NVidia driver 470

I have 61 custom images with marked object on them

when I execute this simple code:

from detecto import core, utils, visualize
dataset = core.Dataset('images_to_learn/')
model = core.Model(['my_object'])
model.fit(dataset)

it fails of model.fit(dataset) with the error:

Epoch 1 of 10
Begin iterating over training dataset
  0%|                                                           | 0/61 [00:00<?, ?it/s]/home/xwizard/.local/lib/python3.9/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  2%|▊                                                  | 1/61 [00:02<02:07,  2.12s/it]
Traceback (most recent call last):
  File "/home/xwizard/test/main.py", line 24, in <module>
    model.fit(dataset)
  File "/home/xwizard/.local/lib/python3.9/site-packages/detecto/core.py", line 505, in fit
    loss_dict = self._model(images, targets)
  File "/home/xwizard/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xwizard/.local/lib/python3.9/site-packages/torchvision/models/detection/generalized_rcnn.py", line 96, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/home/xwizard/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xwizard/.local/lib/python3.9/site-packages/torchvision/models/detection/rpn.py", line 354, in forward
    proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
  File "/home/xwizard/.local/lib/python3.9/site-packages/torchvision/models/detection/_utils.py", line 180, in decode
    pred_boxes = self.decode_single(
  File "/home/xwizard/.local/lib/python3.9/site-packages/torchvision/models/detection/_utils.py", line 223, in decode_single
    pred_boxes1 = pred_ctr_x - c_to_c_w
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 1.96 GiB total capacity; 1.12 GiB already allocated; 2.88 MiB free; 1.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Pytorch just takes all available memory and crashes.

TimurNurlygayanov avatar Jan 22 '22 14:01 TimurNurlygayanov

Could you try some of the solutions listed in this post to see if any of those help?

alankbi avatar Feb 01 '22 22:02 alankbi

By adding:

import gc del dataset gc.collect()

right before I created and ran my dataset, this fixed the issue. Hope this helps @TimurNurlygayanov

makya-stell avatar Mar 18 '22 04:03 makya-stell