mx-maskrcnn src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

When generating RPN detection, after training RPN1, the processing turned down. The error message is shown as below:

Traceback (most recent call last):
  File "train_alternate_mask_fpn.py", line 116, in <module>
    main()
  File "train_alternate_mask_fpn.py", line 113, in main
    args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
  File "train_alternate_mask_fpn.py", line 39, in alternate_train
    vis=False, shuffle=False, thresh=0)
  File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/tools/test_rpn.py", line 60, in test_rpn
    arg_params=arg_params, aux_params=aux_params)
  File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/core/tester.py", line 22, in __init__
    self._mod.bind(provide_data, provide_label, for_training=False)
  File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/core/module.py", line 141, in bind
    force_rebind=False, shared_module=None)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/module.py", line 417, in bind
    state_names=self._state_names)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 231, in __init__
    self.bind_exec(data_shapes, label_shapes, shared_group)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 327, in bind_exec
    shared_group))
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 603, in _bind_ith_exec
    shared_buffer=shared_data_arrays, **input_shapes)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/symbol/symbol.py", line 1491, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 1024, 2048)
im_info: (1, 3L)
[21:01:05] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

I use 4 TITAN XP, with 1 image per GPU. I do not know where the problem is.

Nov 17 '17 15:11 wenhe-jia

Hi, @LeonJWH Have you tried to resume the training from this step?

Nov 18 '17 05:11 Zehaos

@Zehaos I took your advice and tried to resume training from this step, and it goes well by now.
Is this problem occurred during your training progress? What is that for?

Nov 20 '17 02:11 wenhe-jia

@LeonJWH Hi, I encountered this problem during my training, how do you resume training from this step?

Nov 22 '17 06:11 zpp13

@LeonJWH Hi, I encountered this problem during my training, how do you resume training from this step?

Nov 23 '17 08:11 chenmyzju

Hi, @Zehaos I've met the same error and I've tried BATCH_ROIS 128-> 64, still have the error

Nov 24 '17 06:11 chenmyzju

@zpp13 @chenmyzju I just commented out the code for training RPN1, and run bash scripts/train_alternate.sh to resume training from generating RPN detection. And you should kill the progress on your GPU, sometimes the GPU memory won't be released after RPN training process is finished.

Nov 26 '17 02:11 wenhe-jia

Have you guys checked out this another duplicated configuration out of the config.py?

Dec 01 '17 15:12 kaiyuyue

@KaiyuYue yes, set a small batch_rois can reduce the GPU usage when training rcnn, but also get a low performance at the end. Check out in repo https://github.com/LeonJWH/mx-maskrcnn.

Jan 03 '18 07:01 wenhe-jia

I encountered the "out of memory" problem during "# TRAIN RCNN WITH IMAGENET INIT AND RPN DETECTION" the error message is:

DeprecationWarning: Numeric-style type codes are deprecated and will result in an error in the future. label.append(labels[self.label.index('rcnn_label_stride%s' % s)].asnumpy().reshape((-1,)).astype('Int32')) Traceback (most recent call last): File "train_alternate_mask_fpn.py", line 163, in main() File "train_alternate_mask_fpn.py", line 160, in main args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step) File "train_alternate_mask_fpn.py", line 93, in alternate_train train_shared=False, lr=rcnn_lr, lr_step=rcnn_lr_step, proposal='rpn', maskrcnn_stage='rcnn1') File "/home/wp/maskrcnn/mx-maskrcnn-master/rcnn/tools/train_maskrcnn.py", line 208, in train_maskrcnn arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch) File "./incubator-mxnet/python/mxnet/module/base_module.py", line 496, in fit self.update_metric(eval_metric, data_batch.label) File "/home/wp/maskrcnn/mx-maskrcnn-master/rcnn/core/module.py", line 210, in update_metric self._curr_module.update_metric(eval_metric, labels) File "./incubator-mxnet/python/mxnet/module/module.py", line 749, in update_metric self.exec_group.update_metric(eval_metric, labels) File "./incubator-mxnet/python/mxnet/module/executor_group.py", line 616, in update_metric eval_metric.update_dict(labels, preds) File "./incubator-mxnet/python/mxnet/metric.py", line 304, in update_dict metric.update_dict(labels, preds) File "./incubator-mxnet/python/mxnet/metric.py", line 132, in update_dict self.update(label, pred) File "/home/wp/maskrcnn/mx-maskrcnn-master/rcnn/core/metric.py", line 73, in update pred_label = pred.asnumpy().reshape(-1, last_dim).argmax(axis=1).astype('int32') File "./incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1826, in asnumpy ctypes.c_size_t(data.size))) File "./incubator-mxnet/python/mxnet/base.py", line 149, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:47:51] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: out of memory

I changed the BATCH_ROIS 128->32. But useless. Does anybody know how to deal with it?

Mar 28 '18 02:03 zhuaa

solved the problem by killing some "stopped "python process and changed the ROI to smaller.

Mar 28 '18 08:03 zhuaa

@zhuaa What is the size of ROI after you modify it? I have also met this problem which cannot be solved by changing BATCH_ROIS to 64.

Dec 20 '18 10:12 zzw1123

@zzw1123 did you solve the problem? I changed TRAIN.BATCH_ROIS=8 still didn't work

Apr 30 '19 23:04 thomasyue

mx-maskrcnn mx-maskrcnn copied to clipboard

src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

mx-maskrcnn
mx-maskrcnn copied to clipboard