mx-maskrcnn icon indicating copy to clipboard operation
mx-maskrcnn copied to clipboard

src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

Open wenhe-jia opened this issue 7 years ago • 12 comments

When generating RPN detection, after training RPN1, the processing turned down. The error message is shown as below:

Traceback (most recent call last):
  File "train_alternate_mask_fpn.py", line 116, in <module>
    main()
  File "train_alternate_mask_fpn.py", line 113, in main
    args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
  File "train_alternate_mask_fpn.py", line 39, in alternate_train
    vis=False, shuffle=False, thresh=0)
  File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/tools/test_rpn.py", line 60, in test_rpn
    arg_params=arg_params, aux_params=aux_params)
  File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/core/tester.py", line 22, in __init__
    self._mod.bind(provide_data, provide_label, for_training=False)
  File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/core/module.py", line 141, in bind
    force_rebind=False, shared_module=None)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/module.py", line 417, in bind
    state_names=self._state_names)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 231, in __init__
    self.bind_exec(data_shapes, label_shapes, shared_group)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 327, in bind_exec
    shared_group))
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 603, in _bind_ith_exec
    shared_buffer=shared_data_arrays, **input_shapes)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/symbol/symbol.py", line 1491, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 1024, 2048)
im_info: (1, 3L)
[21:01:05] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory

I use 4 TITAN XP, with 1 image per GPU. I do not know where the problem is.

wenhe-jia avatar Nov 17 '17 15:11 wenhe-jia

Hi, @LeonJWH Have you tried to resume the training from this step?

Zehaos avatar Nov 18 '17 05:11 Zehaos

@Zehaos I took your advice and tried to resume training from this step, and it goes well by now.
Is this problem occurred during your training progress? What is that for?

wenhe-jia avatar Nov 20 '17 02:11 wenhe-jia

@LeonJWH Hi, I encountered this problem during my training, how do you resume training from this step?

zpp13 avatar Nov 22 '17 06:11 zpp13

@LeonJWH Hi, I encountered this problem during my training, how do you resume training from this step?

chenmyzju avatar Nov 23 '17 08:11 chenmyzju

Hi, @Zehaos I've met the same error and I've tried BATCH_ROIS 128-> 64, still have the error

chenmyzju avatar Nov 24 '17 06:11 chenmyzju

@zpp13 @chenmyzju I just commented out the code for training RPN1, and run bash scripts/train_alternate.sh to resume training from generating RPN detection. And you should kill the progress on your GPU, sometimes the GPU memory won't be released after RPN training process is finished.

wenhe-jia avatar Nov 26 '17 02:11 wenhe-jia

Have you guys checked out this another duplicated configuration out of the config.py?

kaiyuyue avatar Dec 01 '17 15:12 kaiyuyue

@KaiyuYue yes, set a small batch_rois can reduce the GPU usage when training rcnn, but also get a low performance at the end. Check out in repo https://github.com/LeonJWH/mx-maskrcnn.

wenhe-jia avatar Jan 03 '18 07:01 wenhe-jia

I encountered the "out of memory" problem during "# TRAIN RCNN WITH IMAGENET INIT AND RPN DETECTION" the error message is:

DeprecationWarning: Numeric-style type codes are deprecated and will result in an error in the future. label.append(labels[self.label.index('rcnn_label_stride%s' % s)].asnumpy().reshape((-1,)).astype('Int32')) Traceback (most recent call last): File "train_alternate_mask_fpn.py", line 163, in main() File "train_alternate_mask_fpn.py", line 160, in main args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step) File "train_alternate_mask_fpn.py", line 93, in alternate_train train_shared=False, lr=rcnn_lr, lr_step=rcnn_lr_step, proposal='rpn', maskrcnn_stage='rcnn1') File "/home/wp/maskrcnn/mx-maskrcnn-master/rcnn/tools/train_maskrcnn.py", line 208, in train_maskrcnn arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch) File "./incubator-mxnet/python/mxnet/module/base_module.py", line 496, in fit self.update_metric(eval_metric, data_batch.label) File "/home/wp/maskrcnn/mx-maskrcnn-master/rcnn/core/module.py", line 210, in update_metric self._curr_module.update_metric(eval_metric, labels) File "./incubator-mxnet/python/mxnet/module/module.py", line 749, in update_metric self.exec_group.update_metric(eval_metric, labels) File "./incubator-mxnet/python/mxnet/module/executor_group.py", line 616, in update_metric eval_metric.update_dict(labels, preds) File "./incubator-mxnet/python/mxnet/metric.py", line 304, in update_dict metric.update_dict(labels, preds) File "./incubator-mxnet/python/mxnet/metric.py", line 132, in update_dict self.update(label, pred) File "/home/wp/maskrcnn/mx-maskrcnn-master/rcnn/core/metric.py", line 73, in update pred_label = pred.asnumpy().reshape(-1, last_dim).argmax(axis=1).astype('int32') File "./incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1826, in asnumpy ctypes.c_size_t(data.size))) File "./incubator-mxnet/python/mxnet/base.py", line 149, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [10:47:51] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: out of memory

I changed the BATCH_ROIS 128->32. But useless. Does anybody know how to deal with it?

zhuaa avatar Mar 28 '18 02:03 zhuaa

solved the problem by killing some "stopped "python process and changed the ROI to smaller.

zhuaa avatar Mar 28 '18 08:03 zhuaa

@zhuaa What is the size of ROI after you modify it? I have also met this problem which cannot be solved by changing BATCH_ROIS to 64.

zzw1123 avatar Dec 20 '18 10:12 zzw1123

@zzw1123 did you solve the problem? I changed TRAIN.BATCH_ROIS=8 still didn't work

thomasyue avatar Apr 30 '19 23:04 thomasyue