yolov7 icon indicating copy to clipboard operation
yolov7 copied to clipboard

RuntimeError: CUDA out of memory

Open KendoClaw1 opened this issue 1 year ago • 1 comments

iam trying to train a yolov7-tiny module on a custom dataset, iam training on kaggle which offers a free gpu, pytorch allocated more than 90% of the available memory which results in failure of training, i tried to train on my local machine and i had the same error, tried reducing image size/workers/batch-size and still the same result, and i have no problems training with yolov5 using the same exact setup.

my training command: !python train.py --workers 4 --device 0 --batch-size 16 --data /kaggle/working/dataset/config/custom.yaml --img 640--cfg /kaggle/working/dataset/config/yolov7-custom.yaml --weights 'yolov7-tiny.pt' --name yolov7

Why does pytroch allocate most of the GPU memory?

Error logs:

Traceback (most recent call last): File "train.py", line 610, in train(hyp, opt, device, tb_writer) File "train.py", line 361, in train pred = model(imgs) # forward File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/kaggle/working/dataset/yolov7/models/yolo.py", line 587, in forward return self.forward_once(x, profile) # single-scale inference, train File "/kaggle/working/dataset/yolov7/models/yolo.py", line 613, in forward_once x = m(x) # run File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/kaggle/working/dataset/yolov7/models/common.py", line 108, in forward return self.act(self.bn(self.conv(x))) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 394, in forward return F.silu(input, inplace=self.inplace) File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2032, in silu return torch._C._nn.silu(input) RuntimeError: CUDA out of memory. Tried to allocate 1.56 GiB (GPU 0; 15.90 GiB total capacity; 13.66 GiB already allocated; 236.75 MiB free; 14.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

KendoClaw1 avatar Aug 05 '22 17:08 KendoClaw1

decrease batch size

wangzhao-11a avatar Aug 06 '22 05:08 wangzhao-11a

i also met the same problem I get the same situation after some epochs. Not a solution, but the first time --save_period 10 and after an error occurs, --resume --save_period 10 Then it is possible to continue.

knakanishi24 avatar Aug 09 '22 07:08 knakanishi24

Same problem here! I changed the batch size to 1, reduced image dim and number of workers,.... still the issue is there. The GPU memory usage changes from iteration to iteration! I played with PYTORCH_CUDA_ALLOC_CONF variable too, but the issue did not go away! I also realized that this happens when number of the classes is high (for example over 20 classes). I tested it with calssNum=3 and it worked like a charm.

senstar-hsoleimani avatar Nov 16 '22 15:11 senstar-hsoleimani