yolov7 icon indicating copy to clipboard operation
yolov7 copied to clipboard

Issues that may lead to CUDA out of memory after several epochs

Open MClarkTurner opened this issue 2 years ago • 9 comments

Hello,

I am looking for input on factors that may lead to an out of memory error after several epochs of training. This, to me, is unusual as it suggests there isn't an issue with the batch size (the most likely culprit) and is instead triggered by a some other conditions. I seem to be having experiencing the issue between 15 and 50 epochs of operation and my error log indicates the issue is with the Loss computation, specifically line 732 of loss.py ("pair_wise_cls_loss = F.binary_cross_entropy_with_logits(...") called by the compute_loss_ota command (error below).

RuntimeError: CUDA out of memory. Tried to allocate 334.00 MiB (GPU 0; 11.78 GiB total capacity; 9.41 GiB already allocated; 152.06 MiB free; 10.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

this to me suggests that one of two things are happening:

  1. A stochastic bug brought about by a random conflux of factors, i.e. several images that shouldn't be batched together, by random chance, are grouped together causing the data in memory to spike above the capacity of my GPUs. I am running a test with a fixed random seed to see if I can identify the factors that lead the system to break. If this were to be the case I am curious if the following factors can influence the data consumption and if there are any steps to mitigate this: i. image size - I believe this is being resized to 1280x1280 by the system but would the full image be being stored in memory anyways? ii. number of detections - several of my images have multiple (read >50) detections. When training are all of these held in memory? Is there a way to limit the number of detections for a mosaic-ed image?

  2. Some data is being kept in memory from iteration to iteration. This is accumulating over epochs to a threshold that causes an overflow. I have aimed to mitigate this using gc.collect() and torch.cuda.empty_cache() after each call to del ckpt in train.py but this has not helped the situation. Is there something else I should consider?

For completeness here is the log from YoloV7 with the garbage collection mentioned in point 2. As can be seen the average GPU usage fluctuates over the course of the run from 2.62G (epoch 0) all the way up to 10.2G (epoch 13).

0/999 2.62G 0.0723 0.05062 0.01789 0.1408 23 640: 100%|██████████| 3056/3056 [13:26<00:00, 3.79it/s] Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:43<00:00, 3.58it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.839      0.0697      0.0438      0.0122
 1/999     6.14G   0.05347   0.05427   0.01555    0.1233        24       640: 100%|██████████| 3056/3056 [12:39<00:00,  4.02it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:42<00:00,  3.68it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.861       0.086      0.0664      0.0215
 2/999     10.1G   0.04922   0.05519   0.01405    0.1185        12       640: 100%|██████████| 3056/3056 [12:58<00:00,  3.93it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:42<00:00,  3.66it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.683       0.134      0.0998       0.033
 3/999     8.52G   0.04699   0.05211   0.01198    0.1111         2       640: 100%|██████████| 3056/3056 [12:48<00:00,  3.98it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:38<00:00,  4.09it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.189       0.222        0.13      0.0435
 4/999      8.1G   0.04564   0.05211   0.01069    0.1084       138       640: 100%|██████████| 3056/3056 [12:51<00:00,  3.96it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:34<00:00,  4.47it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.255       0.238       0.145      0.0475
 5/999     10.5G   0.04494   0.05207  0.009186    0.1062        39       640: 100%|██████████| 3056/3056 [12:33<00:00,  4.05it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:34<00:00,  4.55it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.273       0.247       0.155      0.0513
 6/999     8.05G   0.04418   0.05132   0.00865    0.1042       138       640: 100%|██████████| 3056/3056 [12:31<00:00,  4.07it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:34<00:00,  4.50it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.305       0.265       0.176      0.0583
 7/999     8.85G   0.04366   0.05048  0.007873     0.102         8       640: 100%|██████████| 3056/3056 [12:38<00:00,  4.03it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:32<00:00,  4.77it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.321       0.284       0.197      0.0662
 8/999     7.94G   0.04329   0.05041  0.007737    0.1014        57       640: 100%|██████████| 3056/3056 [12:43<00:00,  4.00it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:32<00:00,  4.82it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.321       0.278       0.192      0.0627
 9/999     7.44G   0.04283   0.04934  0.007326   0.09949        38       640: 100%|██████████| 3056/3056 [12:34<00:00,  4.05it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:33<00:00,  4.70it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.342       0.278       0.204      0.0664
10/999     8.55G   0.04274   0.04987  0.006887    0.0995        83       640: 100%|██████████| 3056/3056 [12:33<00:00,  4.06it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:32<00:00,  4.86it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.339       0.302       0.217      0.0765
11/999      5.6G   0.04246   0.04969  0.006769   0.09893        27       640: 100%|██████████| 3056/3056 [12:35<00:00,  4.04it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:31<00:00,  4.93it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.349        0.28       0.205      0.0659
12/999     7.49G   0.04211   0.05023   0.00652   0.09885        12       640: 100%|██████████| 3056/3056 [12:32<00:00,  4.06it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:30<00:00,  5.13it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.394       0.289       0.228      0.0768
13/999     10.2G    0.0418   0.05011  0.006518   0.09842        16       640: 100%|██████████| 3056/3056 [12:28<00:00,  4.08it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:30<00:00,  5.07it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.364       0.297       0.223      0.0745
14/999     9.97G   0.04174   0.04989  0.006495   0.09812       148       640: 100%|██████████| 3056/3056 [12:48<00:00,  3.98it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:31<00:00,  4.95it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712       0.369         0.3       0.227      0.0763
15/999     7.21G   0.04143   0.04982  0.006139   0.09739        23       640: 100%|██████████| 3056/3056 [12:37<00:00,  4.03it/s]
           Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100%|██████████| 156/156 [00:31<00:00,  4.91it/s]

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
             all        1866      108712        0.36       0.312       0.237      0.0819
16/999     8.75G    0.0418   0.06775   0.00714    0.1167       444       640:  33%|███▎      | 1001/3056 [04:15<08:45,  3.91it/s]

Any suggestions would be appreciated!

MClarkTurner avatar Jan 02 '23 14:01 MClarkTurner

Also, I have iteratively lowered the batch size from 16 down to 4. I have seen that the code can run for slightly more epochs each time but this is noticeably slower and doesn't address the fundamental issue that is causing this problem: I still end up getting CUDA out of memory error

MClarkTurner avatar Jan 02 '23 14:01 MClarkTurner

Hi,

this has resolved the OOM issue for me.

Kranklyboy avatar Jan 03 '23 08:01 Kranklyboy

Hi Kranklyboy, thank you for your input. Unfortunately I am already trying this and have included garbage collection after the del ckpt commands on lines 228 and 479 of train.py. Did you place the garbage collection commands anywhere else?

MClarkTurner avatar Jan 03 '23 14:01 MClarkTurner

No, I did not add any additional GC commands.

As for your point 1.i., I am confident that the original fullsize image is not stored in memory somewhere. My dataset uses big images as well, but in my case it is configured to use 640 px squares. If it kept the original size, an overflow would be guaranteed on my hardware.

Regarding 1.ii., my dataset of around 70k images contains about 220 which have more than 50 detections. The image with the most detections has 180. Although the percentage of images having more than 50 detections is quite low, if there was a problem with that many detections, I think I would get the OOM error as well.

For reference, here the log of the first 20 epochs:

 0/299     7.49G
 1/299     5.66G
 2/299      5.9G
 3/299     5.87G
 4/299      5.9G
 5/299     5.93G
 6/299     5.86G
 7/299     5.86G
 8/299     5.85G
 9/299     5.86G
10/299      5.9G
11/299     5.86G
12/299     5.86G
13/299     5.86G
14/299      5.9G
15/299      5.9G
16/299     5.87G
17/299     5.91G
18/299     5.87G
19/299     5.91G

Did you try training the model with 640 px?

EDIT: Your log says img_size = 640 but you specified 1280 earlier. Which one is it?

Kranklyboy avatar Jan 03 '23 15:01 Kranklyboy

I am training with an img_size of 640 but the images in my dataset are currently 1280 (but from what I gather from your response this is a non-issue).

I have increased my batch size up to 16 and I am not seeing any noticeable change to performance (ie. it is not crashing any sooner or later than it was before). I no longer consider the data size or quantity to be the issue I am facing here, it may be an issue with my YOLOv7 download or an issue with my platform. I'll try tearing it all down and rebuilding everything to see if I can fix the issue.

MClarkTurner avatar Jan 04 '23 19:01 MClarkTurner

@MClarkTurner Good day! Did you figure out what the problem was? If yes, could you share details about it?

EvgenyUgolkov avatar Jun 27 '23 09:06 EvgenyUgolkov

hello, i using google collab Tesla T4gpu memory of 15 gb and takes 7k training images for 50 epochs but after 30 epochs it always says that cuda out of memory . can anyone let me know what the issue and how can be resolved.

Sarthak2426 avatar Jul 18 '23 08:07 Sarthak2426

hello, i using google collab Tesla T4gpu memory of 15 gb and takes 7k training images for 50 epochs but after 30 epochs it always says that cuda out of memory . can anyone let me know what the issue and how can be resolved.

May I know what dataset u use? did you use the coco pretrain model? I also use colab T4 to train, it use 3 hrs for 1st epoch, and take 7 hrs for the 2nd. I don't know why the 2nd epoch take more time?

mrgreen3325 avatar Sep 13 '23 11:09 mrgreen3325

Hello, using ota loss creates gpu memory increasement and it can cause run time errors by exploiting the CUDA memory. You should just put loss_ota to 0 (loss_ota: 0) in your hyperparameter.yaml (hyp.scratch.custom.yaml if you use the default) file.

YCAyca avatar Jan 22 '24 13:01 YCAyca