td3d Training aborts when saving checkpoint after epoch 1

Hi, I am currently trying to train the network on the S3DIS dataset using the td3d_is_s3dis-3d-5class config.

The training works fine for all training steps in epoch 1. At the end of the epoch when saving the checkpoint, the memory usage on the GPU suddenly jumps from ~8/9 GB to 18 GB and eventually failing when reaching the limit of 24 GB.

Is this an known issue?

2023-05-16 14:05:55,895 - mmdet - INFO - workflow: [('train', 1)], max: 33 epochs
2023-05-16 14:05:55,895 - mmdet - INFO - Checkpoints will be saved to /mmdetection3d/tools/work_dirs/td3d_is_s3dis-3d-5class by HardDiskBackend.
/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiSparseTensor.py:298: UserWarning: coordinates implicitly converted to torch.IntTensor. To remove this warning, use `.int()` to convert the coords into an torch.IntTensor
  + "coords into an torch.IntTensor"
2023-05-16 14:06:42,654 - mmdet - INFO - Epoch [1][50/663]	lr: 1.000e-03, eta: 5:40:12, time: 0.935, data_time: 0.290, memory: 7883, bbox_loss: 0.8248, cls_loss: 0.7217, inst_loss: 0.7607, loss: 2.3071, grad_norm: 1.5957
2023-05-16 14:07:15,048 - mmdet - INFO - Epoch [1][100/663]	lr: 1.000e-03, eta: 4:47:17, time: 0.648, data_time: 0.016, memory: 7883, bbox_loss: 0.7341, cls_loss: 0.3994, inst_loss: 0.6373, loss: 1.7708, grad_norm: 0.9642
2023-05-16 14:07:50,540 - mmdet - INFO - Epoch [1][150/663]	lr: 1.000e-03, eta: 4:36:46, time: 0.710, data_time: 0.035, memory: 8027, bbox_loss: 0.7062, cls_loss: 0.3581, inst_loss: 0.6261, loss: 1.6904, grad_norm: 1.1923
2023-05-16 14:08:25,190 - mmdet - INFO - Epoch [1][200/663]	lr: 1.000e-03, eta: 4:29:42, time: 0.693, data_time: 0.014, memory: 8027, bbox_loss: 0.6692, cls_loss: 0.3358, inst_loss: 0.6145, loss: 1.6194, grad_norm: 1.0767
2023-05-16 14:09:01,773 - mmdet - INFO - Epoch [1][250/663]	lr: 1.000e-03, eta: 4:28:00, time: 0.732, data_time: 0.023, memory: 8027, bbox_loss: 0.6513, cls_loss: 0.3226, inst_loss: 0.6042, loss: 1.5781, grad_norm: 1.2070
2023-05-16 14:09:39,756 - mmdet - INFO - Epoch [1][300/663]	lr: 1.000e-03, eta: 4:28:21, time: 0.760, data_time: 0.015, memory: 8027, bbox_loss: 0.6300, cls_loss: 0.3100, inst_loss: 0.5524, loss: 1.4923, grad_norm: 1.2423
2023-05-16 14:10:18,196 - mmdet - INFO - Epoch [1][350/663]	lr: 1.000e-03, eta: 4:28:53, time: 0.769, data_time: 0.015, memory: 8027, bbox_loss: 0.6168, cls_loss: 0.3033, inst_loss: 0.5165, loss: 1.4367, grad_norm: 1.2490
2023-05-16 14:11:00,874 - mmdet - INFO - Epoch [1][400/663]	lr: 1.000e-03, eta: 4:32:56, time: 0.854, data_time: 0.056, memory: 8638, bbox_loss: 0.6106, cls_loss: 0.2944, inst_loss: 0.5128, loss: 1.4178, grad_norm: 1.3136
2023-05-16 14:11:40,923 - mmdet - INFO - Epoch [1][450/663]	lr: 1.000e-03, eta: 4:33:49, time: 0.801, data_time: 0.017, memory: 8638, bbox_loss: 0.6041, cls_loss: 0.2857, inst_loss: 0.4876, loss: 1.3774, grad_norm: 1.3142
2023-05-16 14:12:23,333 - mmdet - INFO - Epoch [1][500/663]	lr: 1.000e-03, eta: 4:36:05, time: 0.848, data_time: 0.021, memory: 8638, bbox_loss: 0.5784, cls_loss: 0.2747, inst_loss: 0.4711, loss: 1.3242, grad_norm: 1.2854
2023-05-16 14:13:04,558 - mmdet - INFO - Epoch [1][550/663]	lr: 1.000e-03, eta: 4:37:03, time: 0.824, data_time: 0.014, memory: 8638, bbox_loss: 0.5704, cls_loss: 0.2632, inst_loss: 0.4488, loss: 1.2824, grad_norm: 1.2698
2023-05-16 14:13:47,635 - mmdet - INFO - Epoch [1][600/663]	lr: 1.000e-03, eta: 4:38:49, time: 0.862, data_time: 0.025, memory: 8638, bbox_loss: 0.5713, cls_loss: 0.2618, inst_loss: 0.4385, loss: 1.2715, grad_norm: 1.3437
2023-05-16 14:14:31,222 - mmdet - INFO - Epoch [1][650/663]	lr: 1.000e-03, eta: 4:40:30, time: 0.872, data_time: 0.055, memory: 8638, bbox_loss: 0.5565, cls_loss: 0.2557, inst_loss: 0.4479, loss: 1.2601, grad_norm: 1.3006
2023-05-16 14:14:41,589 - mmdet - INFO - Saving checkpoint at 1 epochs
[>>>>>>                                            ] 9/68, 1.0 task/s, elapsed: 9s, ETA:    60sTraceback (most recent call last):
  File "train.py", line 263, in <module>
    main()
  File "train.py", line 259, in main
    meta=meta)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 351, in train_model
    meta=meta)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 319, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 58, in train
    self.call_hook('after_train_epoch')
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/base_runner.py", line 317, in call_hook
    getattr(hook, fn_name)(self)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/hooks/evaluation.py", line 271, in after_train_epoch
    self._do_evaluate(runner)
  File "/usr/local/lib/python3.7/dist-packages/mmdet/core/evaluation/eval_hooks.py", line 56, in _do_evaluate
    results = single_gpu_test(runner.model, self.dataloader, show=False)
  File "/usr/local/lib/python3.7/dist-packages/mmdet/apis/test.py", line 29, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/data_parallel.py", line 51, in forward
    return super().forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
    return old_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/base.py", line 62, in forward
    return self.forward_test(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/base.py", line 43, in forward_test
    return self.simple_test(points[0], img_metas[0], img[0], **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/td3d_instance_segmentor.py", line 122, in simple_test
    instances = self.head.forward_test(x, field, img_metas)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 556, in forward_test
    cls_preds, idxs, v2r, r2scene, rois, scores, labels = self._forward_second(x[0], src_idxs, bbox_list)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 222, in _forward_second
    preds = self.unet(feats).features
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/backbones/mink_unet.py", line 225, in forward
    out = self.conv0p1s1(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 321, in forward
    input._manager,
  File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
    coordinate_manager._manager,
MemoryError: std::bad_alloc: cudaErrorMemoryAllocation: out of memory

May 16 '23 14:05 meyerjo

Sorry I had rebuilt the container and forgot to change the pre_nms and other score settings before rerunning it. Now it finishes first epoch and starts to train in the second.

May 16 '23 15:05 meyerjo

Well after 5/6 epochs even with the changed parameters the issue reappears. Is this a know problem? Should one further reduce the number of nms samples or increase the threshold?

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 68/68, 3.1 task/s, elapsed: 22s, ETA:     0s2023-05-16 15:50:54,410 - mmdet - INFO - 
+----------+---------+---------+--------+-----------+----------+
| classes  | AP_0.25 | AP_0.50 | AP     | Prec_0.50 | Rec_0.50 |
+----------+---------+---------+--------+-----------+----------+
| ceiling  | 0.5899  | 0.5179  | 0.3701 | 0.9302    | 0.5263   |
| floor    | 0.8962  | 0.8286  | 0.7023 | 0.8730    | 0.8088   |
| wall     | 0.5721  | 0.4045  | 0.2014 | 0.5205    | 0.5190   |
| beam     | 0.0000  | 0.0000  | 0.0000 | 0.0000    | 0.0000   |
| column   | 0.3088  | 0.2377  | 0.1446 | 0.6053    | 0.3108   |
| window   | 0.1490  | 0.1490  | 0.0531 | 0.8889    | 0.1538   |
| door     | 0.6984  | 0.6834  | 0.5071 | 0.9770    | 0.6693   |
| table    | 0.4224  | 0.2576  | 0.1388 | 0.6857    | 0.3117   |
| chair    | 0.9384  | 0.9210  | 0.8083 | 0.9717    | 0.9302   |
| sofa     | 0.5455  | 0.5455  | 0.3061 | 1.0000    | 0.5455   |
| bookcase | 0.4565  | 0.2911  | 0.1721 | 0.5385    | 0.4194   |
| board    | 0.5236  | 0.5093  | 0.4697 | 0.8148    | 0.5238   |
| clutter  | 0.4568  | 0.3821  | 0.2337 | 0.7205    | 0.4311   |
+----------+---------+---------+--------+-----------+----------+
| Overall  | 0.5044  | 0.4406  | 0.3160 | 0.7328    | 0.4730   |
+----------+---------+---------+--------+-----------+----------+
2023-05-16 15:50:54,413 - mmdet - INFO - Exp name: td3d_is_s3dis-3d-5class.py
2023-05-16 15:50:54,414 - mmdet - INFO - Epoch(val) [6][68]	all_ap: 0.3160, all_ap_50%: 0.4406, all_ap_25%: 0.5044, all_prec_50%: 0.7328, all_rec_50%: 0.4730, classes: {'ceiling': {'ap': 0.37012966028620914, 'ap50%': 0.5178716581786341, 'ap25%': 0.589861934779915, 'prec50%': 0.9302325581395349, 'rec50%': 0.5263157894736842}, 'floor': {'ap': 0.7022695748658188, 'ap50%': 0.8285625926119786, 'ap25%': 0.8961839231506277, 'prec50%': 0.873015873015873, 'rec50%': 0.8088235294117647}, 'wall': {'ap': 0.20140254348893574, 'ap50%': 0.40445546441820884, 'ap25%': 0.5720736753000741, 'prec50%': 0.52046783625731, 'rec50%': 0.5189504373177842}, 'beam': {'ap': 0.0, 'ap50%': 0.0, 'ap25%': 0.0, 'prec50%': 0.0, 'rec50%': 0.0}, 'column': {'ap': 0.14461537125452123, 'ap50%': 0.23767443530830934, 'ap25%': 0.3087914307529978, 'prec50%': 0.6052631578947368, 'rec50%': 0.3108108108108108}, 'window': {'ap': 0.05314238230904897, 'ap50%': 0.14900030525030525, 'ap25%': 0.14900030525030525, 'prec50%': 0.8888888888888888, 'rec50%': 0.15384615384615385}, 'door': {'ap': 0.5070772356583534, 'ap50%': 0.6833775032505273, 'ap25%': 0.6983608153362725, 'prec50%': 0.9770114942528736, 'rec50%': 0.6692913385826772}, 'table': {'ap': 0.13884904515659613, 'ap50%': 0.25758335386982195, 'ap25%': 0.42238330807761115, 'prec50%': 0.6857142857142857, 'rec50%': 0.3116883116883117}, 'chair': {'ap': 0.8082939448763926, 'ap50%': 0.9209676907755906, 'ap25%': 0.9384037375013576, 'prec50%': 0.97165991902834, 'rec50%': 0.9302325581395349}, 'sofa': {'ap': 0.30606060606060603, 'ap50%': 0.5454545454545453, 'ap25%': 0.5454545454545453, 'prec50%': 1.0, 'rec50%': 0.5454545454545454}, 'bookcase': {'ap': 0.17214900191686572, 'ap50%': 0.2910532021030779, 'ap25%': 0.45651666158842336, 'prec50%': 0.5384615384615384, 'rec50%': 0.41935483870967744}, 'board': {'ap': 0.4696715600329424, 'ap50%': 0.5092907607216478, 'ap25%': 0.5235747815846161, 'prec50%': 0.8148148148148148, 'rec50%': 0.5238095238095238}, 'clutter': {'ap': 0.23370086962040698, 'ap50%': 0.38210035927469743, 'ap25%': 0.45677949765744164, 'prec50%': 0.720508166969147, 'rec50%': 0.43105320304017375}}
2023-05-16 15:51:48,423 - mmdet - INFO - Epoch [7][50/663]	lr: 1.000e-03, eta: 4:26:33, time: 1.080, data_time: 0.173, memory: 9835, bbox_loss: 0.3948, cls_loss: 0.1208, inst_loss: 0.2057, loss: 0.7212, grad_norm: 1.4133
2023-05-16 15:52:33,724 - mmdet - INFO - Epoch [7][100/663]	lr: 1.000e-03, eta: 4:25:51, time: 0.906, data_time: 0.014, memory: 9835, bbox_loss: 0.3877, cls_loss: 0.1168, inst_loss: 0.2013, loss: 0.7058, grad_norm: 1.4559
2023-05-16 15:53:24,287 - mmdet - INFO - Epoch [7][150/663]	lr: 1.000e-03, eta: 4:25:31, time: 1.011, data_time: 0.040, memory: 9835, bbox_loss: 0.3912, cls_loss: 0.1191, inst_loss: 0.2120, loss: 0.7223, grad_norm: 1.3516
Traceback (most recent call last):
  File "train.py", line 263, in <module>
    main()
  File "train.py", line 259, in main
    meta=meta)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 351, in train_model
    meta=meta)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 319, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 32, in run_iter
    **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.7/dist-packages/mmdet/models/detectors/base.py", line 248, in train_step
    losses = self(**data)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
    return old_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/base.py", line 60, in forward
    return self.forward_train(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/td3d_instance_segmentor.py", line 105, in forward_train
    pts_semantic_mask, pts_instance_mask, img_metas)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 427, in forward_train
    cls_preds, targets, v2r, r2scene, rois, scores, gt_idxs = self._forward_second(x[0], targets, assigned_bbox_list)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 222, in _forward_second
    preds = self.unet(feats).features
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/backbones/mink_unet.py", line 280, in forward
    out = self.block8(out)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/modules/resnet_block.py", line 55, in forward
    out = self.conv1(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 321, in forward
    input._manager,
  File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
    coordinate_manager._manager,
RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 23.65 GiB total capacity; 6.71 GiB already allocated; 193.31 MiB free; 7.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

May 16 '23 16:05 meyerjo

Hi, @meyerjo. It looks a little bit strange. But there are several recommendations below, that might help you to avoid this problem:

check that there are no other processes using gpu memory during training
change batch size to 3 in the line: https://github.com/SamsungLabs/td3d/blob/fd4b4d4335353cead5287bb7a7c604c20602c543/configs/td3d_is/td3d_is_s3dis-3d-5class.py#L132
change unet type to MinkUNet14A in the line: https://github.com/SamsungLabs/td3d/blob/fd4b4d4335353cead5287bb7a7c604c20602c543/configs/td3d_is/td3d_is_s3dis-3d-5class.py#L25 You can use them independently or together. The last two recommendations have to reduce memory consumption, but may also slightly reduce the metrics.

May 16 '23 19:05 col14m