simpledet icon indicating copy to clipboard operation
simpledet copied to clipboard

some error for retina

Open Tveek opened this issue 5 years ago • 8 comments

python3 detection_train.py --config config/NASFPN/retina_r50v1b_nasfpn_640_7@256_25epoch.py,get Strange erros File "detection_train.py", line 278, in train_net(parse_args()) File "detection_train.py", line 260, in train_net profile=profile File "/simpledet/core/detection_module.py", line 1009, in fit self.update_metric(eval_metric, data_batch.label) File "/simpledet/core/detection_module.py", line 789, in update_metric self.exec_group.update_metric(eval_metric, labels, pre_sliced) File "/usr/local/lib/python3.5/dist-packages/mxnet-1.5.0-py3.5.egg/mxnet/module/executor_group.py", line 640, in update_metric eval_metric.update_dict(labels, preds) File "/usr/local/lib/python3.5/dist-packages/mxnet-1.5.0-py3.5.egg/mxnet/metric.py", line 350, in update_dict metric.update_dict(labels, preds) File "/usr/local/lib/python3.5/dist-packages/mxnet-1.5.0-py3.5.egg/mxnet/metric.py", line 133, in update_dict self.update(label, pred) File "*/simpledet/models/retinanet/metric.py", line 34, in update pred_label = pred_label.asnumpy().astype('int32') File "/usr/local/lib/python3.5/dist-packages/mxnet-1.5.0-py3.5.egg/mxnet/ndarray/ndarray.py", line 1996, in asnumpy ctypes.c_size_t(data.size))) File "/usr/local/lib/python3.5/dist-packages/mxnet-1.5.0-py3.5.egg/mxnet/base.py", line 253, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [09:54:30] src/operator/contrib/./focal_loss-inl.h:158: Check failed: allocated_bytes < WORKSPACE_LIMIT (1627797600 vs. 1572864000) : Allocating more memory than workspace limit

Tveek avatar Aug 16 '19 09:08 Tveek

We'll fix this later to dynamically allocate memory. For temporary solution, you can replace 1500 by a larger number, may be 1800, in models/retinanet/builder.py line 314.

xchani avatar Aug 16 '19 11:08 xchani

@xchani not a good solution, some new erros will be introduced.workspace can not more than 2000, otherwise get below: mxnet.base.MXNetError: [13:57:18] pathto/mxnet/3rdparty/mshadow/mshadow/././././cuda/tensor_gpu-inl.cuh:110: Check failed: err ==cudaSuccess (2 vs. 0) : Name: MapPlanKernel ErrStr:out of memory But I have 16g memory per GPU(can not be used fullly)

Tveek avatar Aug 16 '19 14:08 Tveek

Hi @Tveek, Could you please share more information about your software environment like how you install the MXNet?

RogerChern avatar Aug 17 '19 02:08 RogerChern

@RogerChern Install MXNet from Scratch simpledet install . software version:mxnet=1.5.0,CUDA=8.0.61,nvidia-driver=375.26. But, other net(like dcn,efficientnet,faster,tridentnet ) can run normally. My dataset is not coco(200+ class)

Tveek avatar Aug 17 '19 08:08 Tveek

@Tveek Well, this seems to be a bug of upstream MXNet due to the drop of old runtime. Currently, upgrade the CUDA version seems to be the only solution.

RogerChern avatar Aug 17 '19 10:08 RogerChern

@RogerChern Minimum mxnet version requirements for simpledet ?

Tveek avatar Aug 17 '19 12:08 Tveek

Probably not a problem with the mxnet version. using registry.cn-beijing.aliyuncs.com/rogerchen/simpledet:cuda10,It also raises the above problem

Tveek avatar Aug 17 '19 14:08 Tveek

Bug confirmed. It seems if we allocate more than 2000M workspace MXNet always raises OOM. @xchani

RogerChern avatar Aug 18 '19 06:08 RogerChern