yolov7
yolov7 copied to clipboard
Issues that may lead to CUDA out of memory after several epochs
Hello,
I am looking for input on factors that may lead to an out of memory error after several epochs of training. This, to me, is unusual as it suggests there isn't an issue with the batch size (the most likely culprit) and is instead triggered by a some other conditions. I seem to be having experiencing the issue between 15 and 50 epochs of operation and my error log indicates the issue is with the Loss computation, specifically line 732 of loss.py ("pair_wise_cls_loss = F.binary_cross_entropy_with_logits(...") called by the compute_loss_ota command (error below).
RuntimeError: CUDA out of memory. Tried to allocate 334.00 MiB (GPU 0; 11.78 GiB total capacity; 9.41 GiB already allocated; 152.06 MiB free; 10.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
this to me suggests that one of two things are happening:
-
A stochastic bug brought about by a random conflux of factors, i.e. several images that shouldn't be batched together, by random chance, are grouped together causing the data in memory to spike above the capacity of my GPUs. I am running a test with a fixed random seed to see if I can identify the factors that lead the system to break. If this were to be the case I am curious if the following factors can influence the data consumption and if there are any steps to mitigate this: i. image size - I believe this is being resized to 1280x1280 by the system but would the full image be being stored in memory anyways? ii. number of detections - several of my images have multiple (read >50) detections. When training are all of these held in memory? Is there a way to limit the number of detections for a mosaic-ed image?
-
Some data is being kept in memory from iteration to iteration. This is accumulating over epochs to a threshold that causes an overflow. I have aimed to mitigate this using
gc.collect()
andtorch.cuda.empty_cache()
after each call todel ckpt
in train.py but this has not helped the situation. Is there something else I should consider?
For completeness here is the log from YoloV7 with the garbage collection mentioned in point 2. As can be seen the average GPU usage fluctuates over the course of the run from 2.62G (epoch 0) all the way up to 10.2G (epoch 13).
0/999 2.62G 0.0723 0.05062 0.01789 0.1408 23 640: 100%|██████████| 3056/3056 [13:26<00:00, 3.79it/s] Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:43<00:00, 3.58it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.839 0.0697 0.0438 0.0122
1/999 6.14G 0.05347 0.05427 0.01555 0.1233 24 640: 100%|██████████| 3056/3056 [12:39<00:00, 4.02it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:42<00:00, 3.68it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.861 0.086 0.0664 0.0215
2/999 10.1G 0.04922 0.05519 0.01405 0.1185 12 640: 100%|██████████| 3056/3056 [12:58<00:00, 3.93it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:42<00:00, 3.66it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.683 0.134 0.0998 0.033
3/999 8.52G 0.04699 0.05211 0.01198 0.1111 2 640: 100%|██████████| 3056/3056 [12:48<00:00, 3.98it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:38<00:00, 4.09it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.189 0.222 0.13 0.0435
4/999 8.1G 0.04564 0.05211 0.01069 0.1084 138 640: 100%|██████████| 3056/3056 [12:51<00:00, 3.96it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:34<00:00, 4.47it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.255 0.238 0.145 0.0475
5/999 10.5G 0.04494 0.05207 0.009186 0.1062 39 640: 100%|██████████| 3056/3056 [12:33<00:00, 4.05it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:34<00:00, 4.55it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.273 0.247 0.155 0.0513
6/999 8.05G 0.04418 0.05132 0.00865 0.1042 138 640: 100%|██████████| 3056/3056 [12:31<00:00, 4.07it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:34<00:00, 4.50it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.305 0.265 0.176 0.0583
7/999 8.85G 0.04366 0.05048 0.007873 0.102 8 640: 100%|██████████| 3056/3056 [12:38<00:00, 4.03it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:32<00:00, 4.77it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.321 0.284 0.197 0.0662
8/999 7.94G 0.04329 0.05041 0.007737 0.1014 57 640: 100%|██████████| 3056/3056 [12:43<00:00, 4.00it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:32<00:00, 4.82it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.321 0.278 0.192 0.0627
9/999 7.44G 0.04283 0.04934 0.007326 0.09949 38 640: 100%|██████████| 3056/3056 [12:34<00:00, 4.05it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:33<00:00, 4.70it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.342 0.278 0.204 0.0664
10/999 8.55G 0.04274 0.04987 0.006887 0.0995 83 640: 100%|██████████| 3056/3056 [12:33<00:00, 4.06it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:32<00:00, 4.86it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.339 0.302 0.217 0.0765
11/999 5.6G 0.04246 0.04969 0.006769 0.09893 27 640: 100%|██████████| 3056/3056 [12:35<00:00, 4.04it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:31<00:00, 4.93it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.349 0.28 0.205 0.0659
12/999 7.49G 0.04211 0.05023 0.00652 0.09885 12 640: 100%|██████████| 3056/3056 [12:32<00:00, 4.06it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:30<00:00, 5.13it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.394 0.289 0.228 0.0768
13/999 10.2G 0.0418 0.05011 0.006518 0.09842 16 640: 100%|██████████| 3056/3056 [12:28<00:00, 4.08it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:30<00:00, 5.07it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.364 0.297 0.223 0.0745
14/999 9.97G 0.04174 0.04989 0.006495 0.09812 148 640: 100%|██████████| 3056/3056 [12:48<00:00, 3.98it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:31<00:00, 4.95it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.369 0.3 0.227 0.0763
15/999 7.21G 0.04143 0.04982 0.006139 0.09739 23 640: 100%|██████████| 3056/3056 [12:37<00:00, 4.03it/s]
Class Images Labels P R [email protected] [email protected]:.95: 100%|██████████| 156/156 [00:31<00:00, 4.91it/s]
Epoch gpu_mem box obj cls total labels img_size
all 1866 108712 0.36 0.312 0.237 0.0819
16/999 8.75G 0.0418 0.06775 0.00714 0.1167 444 640: 33%|███▎ | 1001/3056 [04:15<08:45, 3.91it/s]
Any suggestions would be appreciated!
Also, I have iteratively lowered the batch size from 16 down to 4. I have seen that the code can run for slightly more epochs each time but this is noticeably slower and doesn't address the fundamental issue that is causing this problem: I still end up getting CUDA out of memory error
Hi Kranklyboy, thank you for your input. Unfortunately I am already trying this and have included garbage collection after the del ckpt
commands on lines 228 and 479 of train.py. Did you place the garbage collection commands anywhere else?
No, I did not add any additional GC commands.
As for your point 1.i., I am confident that the original fullsize image is not stored in memory somewhere. My dataset uses big images as well, but in my case it is configured to use 640 px squares. If it kept the original size, an overflow would be guaranteed on my hardware.
Regarding 1.ii., my dataset of around 70k images contains about 220 which have more than 50 detections. The image with the most detections has 180. Although the percentage of images having more than 50 detections is quite low, if there was a problem with that many detections, I think I would get the OOM error as well.
For reference, here the log of the first 20 epochs:
0/299 7.49G
1/299 5.66G
2/299 5.9G
3/299 5.87G
4/299 5.9G
5/299 5.93G
6/299 5.86G
7/299 5.86G
8/299 5.85G
9/299 5.86G
10/299 5.9G
11/299 5.86G
12/299 5.86G
13/299 5.86G
14/299 5.9G
15/299 5.9G
16/299 5.87G
17/299 5.91G
18/299 5.87G
19/299 5.91G
Did you try training the model with 640 px?
EDIT: Your log says img_size = 640
but you specified 1280 earlier. Which one is it?
I am training with an img_size of 640 but the images in my dataset are currently 1280 (but from what I gather from your response this is a non-issue).
I have increased my batch size up to 16 and I am not seeing any noticeable change to performance (ie. it is not crashing any sooner or later than it was before). I no longer consider the data size or quantity to be the issue I am facing here, it may be an issue with my YOLOv7 download or an issue with my platform. I'll try tearing it all down and rebuilding everything to see if I can fix the issue.
@MClarkTurner Good day! Did you figure out what the problem was? If yes, could you share details about it?
hello, i using google collab Tesla T4gpu memory of 15 gb and takes 7k training images for 50 epochs but after 30 epochs it always says that cuda out of memory . can anyone let me know what the issue and how can be resolved.
hello, i using google collab Tesla T4gpu memory of 15 gb and takes 7k training images for 50 epochs but after 30 epochs it always says that cuda out of memory . can anyone let me know what the issue and how can be resolved.
May I know what dataset u use? did you use the coco pretrain model? I also use colab T4 to train, it use 3 hrs for 1st epoch, and take 7 hrs for the 2nd. I don't know why the 2nd epoch take more time?
Hello, using ota loss creates gpu memory increasement and it can cause run time errors by exploiting the CUDA memory. You should just put loss_ota to 0 (loss_ota: 0) in your hyperparameter.yaml (hyp.scratch.custom.yaml if you use the default) file.