mtcnn Debug info NaN at Train-LogLoss and Train_BBOX

Hi Seanlinx,

I followed your instruction to generate training data for the 3 CNN networks (P, R, O_Net) from WIDER_FACE training dataset. Everything seems to work well since no error appears.

However, when I use demo.py (with my own trained model) to run test on some photos, error appears:

Called with argument:
Namespace(batch_size=[2048, 256, 16], epoch=[16, 16, 16], gpu_id=-1, min_face=40, prefix=['model/pnet', 'model/rnet', 'model/onet'], slide_window=False, stride=2, thresh=[0.5, 0.5, 0.7])
/Users/dhuynh/Documents/TestCode/mtcnn/core/MtcnnDetector.py:357: RuntimeWarning: invalid value encountered in greater
  keep_inds = np.where(cls_scores > self.thresh[1])[0]
Traceback (most recent call last):
  File "demo.py", line 94, in <module>
    args.stride, args.slide_window)
  File "demo.py", line 48, in test_net
    boxes, boxes_c = mtcnn_detector.detect_onet(img, boxes_c)
  File "/Users/dhuynh/Documents/TestCode/mtcnn/core/MtcnnDetector.py", line 391, in detect_onet
    dets = self.convert_to_square(dets)
  File "/Users/dhuynh/Documents/TestCode/mtcnn/core/MtcnnDetector.py", line 47, in convert_to_square
    square_bbox = bbox.copy()
AttributeError: 'NoneType' object has no attribute 'copy'

I check debug info of training phase and find that Train-LogLoss=nan and Train_BBOX_MSE=nan all the time.

INFO:root:Epoch[2] Batch [1600] Speed: 4544.95 samples/sec      Train-Accuracy=0.813565
INFO:root:Epoch[2] Batch [1600] Speed: 4544.95 samples/sec      Train-LogLoss=nan
INFO:root:Epoch[2] Batch [1600] Speed: 4544.95 samples/sec      Train-BBOX_MSE=nan

Your trained model still works perfectly. So it seems that the error stems from my training phase, but I cannot figure out what I did wrong. Could you please help me out? Thanks.

Feb 11 '17 17:02 huynhthedang

@huynhthedang , I got the same problem. The cls_prob and the pred_delta after 2 databatch become very big , ~10^24 .

Feb 12 '17 05:02 peyer

Did you make any changes to the code?

Feb 12 '17 05:02 Seanlinx

I only change the path to my data path , and I have modified mxnet/src/regression_output-inl.h according to mxnet_diff.patch

Feb 12 '17 05:02 peyer

@Seanlinx I try to inspect every op's input and output , but I haven't searched the method to go into op's forward function , dou you know how to do it?

Feb 12 '17 05:02 peyer

Usually nan problem occurs when the init weight of some layers are set too large. And I've tested the current initialization is ok. Are you using the latest version of mxnet? You can try to downgrade it to 0.7 to see if this error still exists. @peyer You can just print the input you want on an op's forward function, if you're using gpu, then you shall copy it to cpu before you can print it.

Feb 12 '17 06:02 Seanlinx

I actually use the latest version of mxnet . I will download the v0.7 one to try again. I know how to use print in negativemining op , but in conv/relu/softmaxoutput/linearRegression , I do not know the op's python script .

Feb 12 '17 06:02 peyer

@Seanlinx I still have a very confusing problem . In the LinearRegression op , does it compute (conv4_2 - bbox_target)^2 = bbox_loss ? If that , then in negativemining op , the input is the bbox_loss , but you compute ( bbox_loss -bbox_target)^2 as the bbox_loss ?

Feb 12 '17 06:02 peyer

There's no python scripts for these ops. You can find the c++ scripts on mxnet/src/operator. The output of linear regression op is identical to its input

Feb 12 '17 06:02 Seanlinx

@Seanlinx Thanks a lot . I maybe know how to continue

Feb 12 '17 06:02 peyer

try to use small lr : 0.00001@peyer

Feb 13 '17 05:02 PierreHao

@PierreHao thank you, lr=0.00001 is OK for training P_net and O_net

However, it's still not ok for R_net. Do you have any idea?

Feb 13 '17 07:02 huynhthedang

try to set bigger batch size, if not work, i think there is something wrong with your data or net. no detail info, it's hard to determine. @huynhthedang

Feb 13 '17 07:02 PierreHao

I've tested the code on ubuntu14.04 + mxnet v0.9.3 but I didn't encounter the nan problem you mentioned. @peyer @huynhthedang

Feb 13 '17 08:02 Seanlinx

For training P_Net , I set the pos : part : neg = 1 : 2 : 5 , then I train the P_Net with other training params keeping , then I encounter the nan problem

Feb 13 '17 10:02 peyer

@Seanlinx I use your gen_pnet_data.py , then pos : part : neg = 2 :3 : 6 , then I start the train_P_net.py , it has worked . I only change the some generating params in gen_pnet_data.py like randint() , the final result become 1 : 2 : 5 , with the learning params you set keeping , it will lead to nan . However , I cannot understand , can you tell the reason ?

Feb 13 '17 13:02 peyer

@Seanlinx I downgrade the mxnet vision to 0.7, L found the mxnet_diff.path was not used. I face the same question of loss = Nan. @peyer I want to know whether you have solved the problem of training loss nan, hope your help.

Jun 30 '18 01:06 lixiaohui2020

Debug info NaN at Train-LogLoss and Train_BBOX_MSE