Training aborts when saving checkpoint after epoch 1
Hi, I am currently trying to train the network on the S3DIS dataset using the td3d_is_s3dis-3d-5class config.
The training works fine for all training steps in epoch 1. At the end of the epoch when saving the checkpoint, the memory usage on the GPU suddenly jumps from ~8/9 GB to 18 GB and eventually failing when reaching the limit of 24 GB.
Is this an known issue?
2023-05-16 14:05:55,895 - mmdet - INFO - workflow: [('train', 1)], max: 33 epochs
2023-05-16 14:05:55,895 - mmdet - INFO - Checkpoints will be saved to /mmdetection3d/tools/work_dirs/td3d_is_s3dis-3d-5class by HardDiskBackend.
/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiSparseTensor.py:298: UserWarning: coordinates implicitly converted to torch.IntTensor. To remove this warning, use `.int()` to convert the coords into an torch.IntTensor
+ "coords into an torch.IntTensor"
2023-05-16 14:06:42,654 - mmdet - INFO - Epoch [1][50/663] lr: 1.000e-03, eta: 5:40:12, time: 0.935, data_time: 0.290, memory: 7883, bbox_loss: 0.8248, cls_loss: 0.7217, inst_loss: 0.7607, loss: 2.3071, grad_norm: 1.5957
2023-05-16 14:07:15,048 - mmdet - INFO - Epoch [1][100/663] lr: 1.000e-03, eta: 4:47:17, time: 0.648, data_time: 0.016, memory: 7883, bbox_loss: 0.7341, cls_loss: 0.3994, inst_loss: 0.6373, loss: 1.7708, grad_norm: 0.9642
2023-05-16 14:07:50,540 - mmdet - INFO - Epoch [1][150/663] lr: 1.000e-03, eta: 4:36:46, time: 0.710, data_time: 0.035, memory: 8027, bbox_loss: 0.7062, cls_loss: 0.3581, inst_loss: 0.6261, loss: 1.6904, grad_norm: 1.1923
2023-05-16 14:08:25,190 - mmdet - INFO - Epoch [1][200/663] lr: 1.000e-03, eta: 4:29:42, time: 0.693, data_time: 0.014, memory: 8027, bbox_loss: 0.6692, cls_loss: 0.3358, inst_loss: 0.6145, loss: 1.6194, grad_norm: 1.0767
2023-05-16 14:09:01,773 - mmdet - INFO - Epoch [1][250/663] lr: 1.000e-03, eta: 4:28:00, time: 0.732, data_time: 0.023, memory: 8027, bbox_loss: 0.6513, cls_loss: 0.3226, inst_loss: 0.6042, loss: 1.5781, grad_norm: 1.2070
2023-05-16 14:09:39,756 - mmdet - INFO - Epoch [1][300/663] lr: 1.000e-03, eta: 4:28:21, time: 0.760, data_time: 0.015, memory: 8027, bbox_loss: 0.6300, cls_loss: 0.3100, inst_loss: 0.5524, loss: 1.4923, grad_norm: 1.2423
2023-05-16 14:10:18,196 - mmdet - INFO - Epoch [1][350/663] lr: 1.000e-03, eta: 4:28:53, time: 0.769, data_time: 0.015, memory: 8027, bbox_loss: 0.6168, cls_loss: 0.3033, inst_loss: 0.5165, loss: 1.4367, grad_norm: 1.2490
2023-05-16 14:11:00,874 - mmdet - INFO - Epoch [1][400/663] lr: 1.000e-03, eta: 4:32:56, time: 0.854, data_time: 0.056, memory: 8638, bbox_loss: 0.6106, cls_loss: 0.2944, inst_loss: 0.5128, loss: 1.4178, grad_norm: 1.3136
2023-05-16 14:11:40,923 - mmdet - INFO - Epoch [1][450/663] lr: 1.000e-03, eta: 4:33:49, time: 0.801, data_time: 0.017, memory: 8638, bbox_loss: 0.6041, cls_loss: 0.2857, inst_loss: 0.4876, loss: 1.3774, grad_norm: 1.3142
2023-05-16 14:12:23,333 - mmdet - INFO - Epoch [1][500/663] lr: 1.000e-03, eta: 4:36:05, time: 0.848, data_time: 0.021, memory: 8638, bbox_loss: 0.5784, cls_loss: 0.2747, inst_loss: 0.4711, loss: 1.3242, grad_norm: 1.2854
2023-05-16 14:13:04,558 - mmdet - INFO - Epoch [1][550/663] lr: 1.000e-03, eta: 4:37:03, time: 0.824, data_time: 0.014, memory: 8638, bbox_loss: 0.5704, cls_loss: 0.2632, inst_loss: 0.4488, loss: 1.2824, grad_norm: 1.2698
2023-05-16 14:13:47,635 - mmdet - INFO - Epoch [1][600/663] lr: 1.000e-03, eta: 4:38:49, time: 0.862, data_time: 0.025, memory: 8638, bbox_loss: 0.5713, cls_loss: 0.2618, inst_loss: 0.4385, loss: 1.2715, grad_norm: 1.3437
2023-05-16 14:14:31,222 - mmdet - INFO - Epoch [1][650/663] lr: 1.000e-03, eta: 4:40:30, time: 0.872, data_time: 0.055, memory: 8638, bbox_loss: 0.5565, cls_loss: 0.2557, inst_loss: 0.4479, loss: 1.2601, grad_norm: 1.3006
2023-05-16 14:14:41,589 - mmdet - INFO - Saving checkpoint at 1 epochs
[>>>>>> ] 9/68, 1.0 task/s, elapsed: 9s, ETA: 60sTraceback (most recent call last):
File "train.py", line 263, in <module>
main()
File "train.py", line 259, in main
meta=meta)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 351, in train_model
meta=meta)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 319, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 58, in train
self.call_hook('after_train_epoch')
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/hooks/evaluation.py", line 271, in after_train_epoch
self._do_evaluate(runner)
File "/usr/local/lib/python3.7/dist-packages/mmdet/core/evaluation/eval_hooks.py", line 56, in _do_evaluate
results = single_gpu_test(runner.model, self.dataloader, show=False)
File "/usr/local/lib/python3.7/dist-packages/mmdet/apis/test.py", line 29, in single_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/data_parallel.py", line 51, in forward
return super().forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/base.py", line 62, in forward
return self.forward_test(**kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/base.py", line 43, in forward_test
return self.simple_test(points[0], img_metas[0], img[0], **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/td3d_instance_segmentor.py", line 122, in simple_test
instances = self.head.forward_test(x, field, img_metas)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 556, in forward_test
cls_preds, idxs, v2r, r2scene, rois, scores, labels = self._forward_second(x[0], src_idxs, bbox_list)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 222, in _forward_second
preds = self.unet(feats).features
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/backbones/mink_unet.py", line 225, in forward
out = self.conv0p1s1(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 321, in forward
input._manager,
File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
coordinate_manager._manager,
MemoryError: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
Sorry I had rebuilt the container and forgot to change the pre_nms and other score settings before rerunning it. Now it finishes first epoch and starts to train in the second.
Well after 5/6 epochs even with the changed parameters the issue reappears. Is this a know problem? Should one further reduce the number of nms samples or increase the threshold?
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 68/68, 3.1 task/s, elapsed: 22s, ETA: 0s2023-05-16 15:50:54,410 - mmdet - INFO -
+----------+---------+---------+--------+-----------+----------+
| classes | AP_0.25 | AP_0.50 | AP | Prec_0.50 | Rec_0.50 |
+----------+---------+---------+--------+-----------+----------+
| ceiling | 0.5899 | 0.5179 | 0.3701 | 0.9302 | 0.5263 |
| floor | 0.8962 | 0.8286 | 0.7023 | 0.8730 | 0.8088 |
| wall | 0.5721 | 0.4045 | 0.2014 | 0.5205 | 0.5190 |
| beam | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| column | 0.3088 | 0.2377 | 0.1446 | 0.6053 | 0.3108 |
| window | 0.1490 | 0.1490 | 0.0531 | 0.8889 | 0.1538 |
| door | 0.6984 | 0.6834 | 0.5071 | 0.9770 | 0.6693 |
| table | 0.4224 | 0.2576 | 0.1388 | 0.6857 | 0.3117 |
| chair | 0.9384 | 0.9210 | 0.8083 | 0.9717 | 0.9302 |
| sofa | 0.5455 | 0.5455 | 0.3061 | 1.0000 | 0.5455 |
| bookcase | 0.4565 | 0.2911 | 0.1721 | 0.5385 | 0.4194 |
| board | 0.5236 | 0.5093 | 0.4697 | 0.8148 | 0.5238 |
| clutter | 0.4568 | 0.3821 | 0.2337 | 0.7205 | 0.4311 |
+----------+---------+---------+--------+-----------+----------+
| Overall | 0.5044 | 0.4406 | 0.3160 | 0.7328 | 0.4730 |
+----------+---------+---------+--------+-----------+----------+
2023-05-16 15:50:54,413 - mmdet - INFO - Exp name: td3d_is_s3dis-3d-5class.py
2023-05-16 15:50:54,414 - mmdet - INFO - Epoch(val) [6][68] all_ap: 0.3160, all_ap_50%: 0.4406, all_ap_25%: 0.5044, all_prec_50%: 0.7328, all_rec_50%: 0.4730, classes: {'ceiling': {'ap': 0.37012966028620914, 'ap50%': 0.5178716581786341, 'ap25%': 0.589861934779915, 'prec50%': 0.9302325581395349, 'rec50%': 0.5263157894736842}, 'floor': {'ap': 0.7022695748658188, 'ap50%': 0.8285625926119786, 'ap25%': 0.8961839231506277, 'prec50%': 0.873015873015873, 'rec50%': 0.8088235294117647}, 'wall': {'ap': 0.20140254348893574, 'ap50%': 0.40445546441820884, 'ap25%': 0.5720736753000741, 'prec50%': 0.52046783625731, 'rec50%': 0.5189504373177842}, 'beam': {'ap': 0.0, 'ap50%': 0.0, 'ap25%': 0.0, 'prec50%': 0.0, 'rec50%': 0.0}, 'column': {'ap': 0.14461537125452123, 'ap50%': 0.23767443530830934, 'ap25%': 0.3087914307529978, 'prec50%': 0.6052631578947368, 'rec50%': 0.3108108108108108}, 'window': {'ap': 0.05314238230904897, 'ap50%': 0.14900030525030525, 'ap25%': 0.14900030525030525, 'prec50%': 0.8888888888888888, 'rec50%': 0.15384615384615385}, 'door': {'ap': 0.5070772356583534, 'ap50%': 0.6833775032505273, 'ap25%': 0.6983608153362725, 'prec50%': 0.9770114942528736, 'rec50%': 0.6692913385826772}, 'table': {'ap': 0.13884904515659613, 'ap50%': 0.25758335386982195, 'ap25%': 0.42238330807761115, 'prec50%': 0.6857142857142857, 'rec50%': 0.3116883116883117}, 'chair': {'ap': 0.8082939448763926, 'ap50%': 0.9209676907755906, 'ap25%': 0.9384037375013576, 'prec50%': 0.97165991902834, 'rec50%': 0.9302325581395349}, 'sofa': {'ap': 0.30606060606060603, 'ap50%': 0.5454545454545453, 'ap25%': 0.5454545454545453, 'prec50%': 1.0, 'rec50%': 0.5454545454545454}, 'bookcase': {'ap': 0.17214900191686572, 'ap50%': 0.2910532021030779, 'ap25%': 0.45651666158842336, 'prec50%': 0.5384615384615384, 'rec50%': 0.41935483870967744}, 'board': {'ap': 0.4696715600329424, 'ap50%': 0.5092907607216478, 'ap25%': 0.5235747815846161, 'prec50%': 0.8148148148148148, 'rec50%': 0.5238095238095238}, 'clutter': {'ap': 0.23370086962040698, 'ap50%': 0.38210035927469743, 'ap25%': 0.45677949765744164, 'prec50%': 0.720508166969147, 'rec50%': 0.43105320304017375}}
2023-05-16 15:51:48,423 - mmdet - INFO - Epoch [7][50/663] lr: 1.000e-03, eta: 4:26:33, time: 1.080, data_time: 0.173, memory: 9835, bbox_loss: 0.3948, cls_loss: 0.1208, inst_loss: 0.2057, loss: 0.7212, grad_norm: 1.4133
2023-05-16 15:52:33,724 - mmdet - INFO - Epoch [7][100/663] lr: 1.000e-03, eta: 4:25:51, time: 0.906, data_time: 0.014, memory: 9835, bbox_loss: 0.3877, cls_loss: 0.1168, inst_loss: 0.2013, loss: 0.7058, grad_norm: 1.4559
2023-05-16 15:53:24,287 - mmdet - INFO - Epoch [7][150/663] lr: 1.000e-03, eta: 4:25:31, time: 1.011, data_time: 0.040, memory: 9835, bbox_loss: 0.3912, cls_loss: 0.1191, inst_loss: 0.2120, loss: 0.7223, grad_norm: 1.3516
Traceback (most recent call last):
File "train.py", line 263, in <module>
main()
File "train.py", line 259, in main
meta=meta)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 351, in train_model
meta=meta)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 319, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 32, in run_iter
**kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.7/dist-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/base.py", line 60, in forward
return self.forward_train(**kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/td3d_instance_segmentor.py", line 105, in forward_train
pts_semantic_mask, pts_instance_mask, img_metas)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 427, in forward_train
cls_preds, targets, v2r, r2scene, rois, scores, gt_idxs = self._forward_second(x[0], targets, assigned_bbox_list)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 222, in _forward_second
preds = self.unet(feats).features
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/backbones/mink_unet.py", line 280, in forward
out = self.block8(out)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/modules/resnet_block.py", line 55, in forward
out = self.conv1(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 321, in forward
input._manager,
File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
coordinate_manager._manager,
RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 23.65 GiB total capacity; 6.71 GiB already allocated; 193.31 MiB free; 7.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Hi, @meyerjo. It looks a little bit strange. But there are several recommendations below, that might help you to avoid this problem:
- check that there are no other processes using gpu memory during training
- change batch size to
3in the line: https://github.com/SamsungLabs/td3d/blob/fd4b4d4335353cead5287bb7a7c604c20602c543/configs/td3d_is/td3d_is_s3dis-3d-5class.py#L132 - change unet type to
MinkUNet14Ain the line: https://github.com/SamsungLabs/td3d/blob/fd4b4d4335353cead5287bb7a7c604c20602c543/configs/td3d_is/td3d_is_s3dis-3d-5class.py#L25 You can use them independently or together. The last two recommendations have to reduce memory consumption, but may also slightly reduce the metrics.