yolov9 icon indicating copy to clipboard operation
yolov9 copied to clipboard

CUDA OUT OF MEMORY

Open MuhammadBilal848 opened this issue 11 months ago • 6 comments

I have set everything for custom training the model and using this command to train the model (I am running this on my laptop):

python train_dual.py --workers 8 --device 0 --batch 8 --data 'LP/data.yaml' --img 640 --cfg models/detect/yolov9-e.yaml --weights 'yolov9-e.pt' --name yolov9-e-finetuning --hyp hyp.scratch-high.yaml --min-items 0 --epochs 10 --close-mosaic 15

Getting this error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 6.00 GiB of which 2.62 GiB is free. Of the allocated memory 2.24 GiB is allocated by PyTorch, and 78.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Here's my GPU specs:

image

MuhammadBilal848 avatar Mar 22 '24 22:03 MuhammadBilal848

Reduce your batches

And try with using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Youho99 avatar Mar 23 '24 10:03 Youho99

Reduce your batches

And try with using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

I'm using a CLI command, how can I use this PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True with python train_dual.py --workers 8 --device 0 --batch 8 --data 'LP/data.yaml' --img 640 --cfg models/detect/yolov9-e.yaml --weights 'yolov9-e.pt' --name yolov9-e-finetuning --hyp hyp.scratch-high.yaml --min-items 0 --epochs 10 --close-mosaic 15?

MuhammadBilal848 avatar Mar 23 '24 19:03 MuhammadBilal848

You need to put it in Environment variable (before running your command)

But try with changing only your batch size also

Youho99 avatar Mar 23 '24 20:03 Youho99

Thank you for the good answer. I am also experiencing the same problem. There are fewer FLOPs and Params than YOLOv8-x. Why does YOLOv8 run, but YOLOv9 gives an error saying there is not enough memory?

kuacboss avatar Mar 27 '24 10:03 kuacboss

Three things you can try to get you started:

  1. Reduce batch size

  2. Reduce dataset size

  3. In train.py, after line 479 "del ckpt", enter the following two lines torch.cuda.empty_cache() gc.collect()

    remember to import gc in the beginning. Screenshot 2024-03-29 215946

shubzk avatar Mar 29 '24 16:03 shubzk