mx-maskrcnn
mx-maskrcnn copied to clipboard
src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory
When generating RPN detection, after training RPN1, the processing turned down. The error message is shown as below:
Traceback (most recent call last):
File "train_alternate_mask_fpn.py", line 116, in <module>
main()
File "train_alternate_mask_fpn.py", line 113, in main
args.rcnn_epoch, args.rcnn_lr, args.rcnn_lr_step)
File "train_alternate_mask_fpn.py", line 39, in alternate_train
vis=False, shuffle=False, thresh=0)
File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/tools/test_rpn.py", line 60, in test_rpn
arg_params=arg_params, aux_params=aux_params)
File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/core/tester.py", line 22, in __init__
self._mod.bind(provide_data, provide_label, for_training=False)
File "/home/jiawenhe/workspace/mx-maskrcnn/rcnn/core/module.py", line 141, in bind
force_rebind=False, shared_module=None)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/module.py", line 417, in bind
state_names=self._state_names)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 231, in __init__
self.bind_exec(data_shapes, label_shapes, shared_group)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 327, in bind_exec
shared_group))
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/module/executor_group.py", line 603, in _bind_ith_exec
shared_buffer=shared_data_arrays, **input_shapes)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.12.0-py2.7.egg/mxnet/symbol/symbol.py", line 1491, in simple_bind
raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 1024, 2048)
im_info: (1, 3L)
[21:01:05] src/storage/./pooled_storage_manager.h:102: cudaMalloc failed: out of memory
I use 4 TITAN XP, with 1 image per GPU. I do not know where the problem is.
Hi, @LeonJWH Have you tried to resume the training from this step?
@Zehaos I took your advice and tried to resume training from this step, and it goes well by now.
Is this problem occurred during your training progress? What is that for?
@LeonJWH Hi, I encountered this problem during my training, how do you resume training from this step?
@LeonJWH Hi, I encountered this problem during my training, how do you resume training from this step?
Hi, @Zehaos I've met the same error and I've tried BATCH_ROIS 128-> 64, still have the error
@zpp13 @chenmyzju I just commented out the code for training RPN1, and run bash scripts/train_alternate.sh
to resume training from generating RPN detection. And you should kill the progress on your GPU, sometimes the GPU memory won't be released after RPN training process is finished.
Have you guys checked out this another duplicated configuration out of the config.py?
@KaiyuYue yes, set a small batch_rois can reduce the GPU usage when training rcnn, but also get a low performance at the end. Check out in repo https://github.com/LeonJWH/mx-maskrcnn.
I encountered the "out of memory" problem during "# TRAIN RCNN WITH IMAGENET INIT AND RPN DETECTION" the error message is:
DeprecationWarning: Numeric-style type codes are deprecated and will result in an error in the future.
label.append(labels[self.label.index('rcnn_label_stride%s' % s)].asnumpy().reshape((-1,)).astype('Int32'))
Traceback (most recent call last):
File "train_alternate_mask_fpn.py", line 163, in
I changed the BATCH_ROIS 128->32. But useless. Does anybody know how to deal with it?
solved the problem by killing some "stopped "python process and changed the ROI to smaller.
@zhuaa What is the size of ROI after you modify it? I have also met this problem which cannot be solved by changing BATCH_ROIS to 64.
@zzw1123 did you solve the problem? I changed TRAIN.BATCH_ROIS=8 still didn't work