Debug info NaN at Train-LogLoss and Train_BBOX_MSE
Hi Seanlinx,
I followed your instruction to generate training data for the 3 CNN networks (P, R, O_Net) from WIDER_FACE training dataset. Everything seems to work well since no error appears.
However, when I use demo.py (with my own trained model) to run test on some photos, error appears:
Called with argument:
Namespace(batch_size=[2048, 256, 16], epoch=[16, 16, 16], gpu_id=-1, min_face=40, prefix=['model/pnet', 'model/rnet', 'model/onet'], slide_window=False, stride=2, thresh=[0.5, 0.5, 0.7])
/Users/dhuynh/Documents/TestCode/mtcnn/core/MtcnnDetector.py:357: RuntimeWarning: invalid value encountered in greater
keep_inds = np.where(cls_scores > self.thresh[1])[0]
Traceback (most recent call last):
File "demo.py", line 94, in <module>
args.stride, args.slide_window)
File "demo.py", line 48, in test_net
boxes, boxes_c = mtcnn_detector.detect_onet(img, boxes_c)
File "/Users/dhuynh/Documents/TestCode/mtcnn/core/MtcnnDetector.py", line 391, in detect_onet
dets = self.convert_to_square(dets)
File "/Users/dhuynh/Documents/TestCode/mtcnn/core/MtcnnDetector.py", line 47, in convert_to_square
square_bbox = bbox.copy()
AttributeError: 'NoneType' object has no attribute 'copy'
I check debug info of training phase and find that Train-LogLoss=nan and Train_BBOX_MSE=nan all the time.
INFO:root:Epoch[2] Batch [1600] Speed: 4544.95 samples/sec Train-Accuracy=0.813565
INFO:root:Epoch[2] Batch [1600] Speed: 4544.95 samples/sec Train-LogLoss=nan
INFO:root:Epoch[2] Batch [1600] Speed: 4544.95 samples/sec Train-BBOX_MSE=nan
Your trained model still works perfectly. So it seems that the error stems from my training phase, but I cannot figure out what I did wrong. Could you please help me out? Thanks.
@huynhthedang , I got the same problem. The cls_prob and the pred_delta after 2 databatch become very big , ~10^24 .
Did you make any changes to the code?
I only change the path to my data path , and I have modified mxnet/src/regression_output-inl.h according to mxnet_diff.patch
@Seanlinx I try to inspect every op's input and output , but I haven't searched the method to go into op's forward function , dou you know how to do it?
Usually nan problem occurs when the init weight of some layers are set too large. And I've tested the current initialization is ok. Are you using the latest version of mxnet? You can try to downgrade it to 0.7 to see if this error still exists. @peyer You can just print the input you want on an op's forward function, if you're using gpu, then you shall copy it to cpu before you can print it.
I actually use the latest version of mxnet . I will download the v0.7 one to try again. I know how to use print in negativemining op , but in conv/relu/softmaxoutput/linearRegression , I do not know the op's python script .
@Seanlinx I still have a very confusing problem . In the LinearRegression op , does it compute (conv4_2 - bbox_target)^2 = bbox_loss ? If that , then in negativemining op , the input is the bbox_loss , but you compute ( bbox_loss -bbox_target)^2 as the bbox_loss ?
There's no python scripts for these ops. You can find the c++ scripts on mxnet/src/operator. The output of linear regression op is identical to its input
@Seanlinx Thanks a lot . I maybe know how to continue
try to use small lr : 0.00001@peyer
@PierreHao thank you, lr=0.00001 is OK for training P_net and O_net
However, it's still not ok for R_net. Do you have any idea?
try to set bigger batch size, if not work, i think there is something wrong with your data or net. no detail info, it's hard to determine. @huynhthedang
I've tested the code on ubuntu14.04 + mxnet v0.9.3 but I didn't encounter the nan problem you mentioned. @peyer @huynhthedang
For training P_Net , I set the pos : part : neg = 1 : 2 : 5 , then I train the P_Net with other training params keeping , then I encounter the nan problem
@Seanlinx I use your gen_pnet_data.py , then pos : part : neg = 2 :3 : 6 , then I start the train_P_net.py , it has worked . I only change the some generating params in gen_pnet_data.py like randint() , the final result become 1 : 2 : 5 , with the learning params you set keeping , it will lead to nan . However , I cannot understand , can you tell the reason ?
@Seanlinx I downgrade the mxnet vision to 0.7, L found the mxnet_diff.path was not used. I face the same question of loss = Nan. @peyer I want to know whether you have solved the problem of training loss nan, hope your help.