Relation-Networks-for-Object-Detection icon indicating copy to clipboard operation
Relation-Networks-for-Object-Detection copied to clipboard

one problem in the network training

Open WBinke opened this issue 7 years ago • 2 comments
trafficstars

When I'm training on the coco as the README declared,I meet this problem just like the blod log,and then the NMSLoss_pos and the NMSLoss_neg become nan,does anyone meet the same problem and give me some help?

_20180702195240

('lr', 0.0005, 'lr_epoch_diff', [5.33], 'lr_iters', [625027]) Epoch[0] Batch [100] Speed: 5.08 samples/sec Train-RPNAcc=0.847250, RPNLogLoss=0.376764, RPNL1Loss=0.187504, RCNNAcc=0.801361, RCNNLogLoss=1.674762, RCNNL1Loss=0.311297, NMSLoss_pos=0.035744, NMSLoss_neg=0.016391, NMSAcc_pos=0.000000, NMSAcc_neg=1.000000, Epoch[0] Batch [200] Speed: 5.10 samples/sec Train-RPNAcc=0.865089, RPNLogLoss=0.328289, RPNL1Loss=0.176516, RCNNAcc=0.811237, RCNNLogLoss=1.380794, RCNNL1Loss=0.316205, NMSLoss_pos=0.048681, NMSLoss_neg=0.013534, NMSAcc_pos=0.000000, NMSAcc_neg=1.000000, Epoch[0] Batch [300] Speed: 5.11 samples/sec Train-RPNAcc=0.874916, RPNLogLoss=0.302038, RPNL1Loss=0.159570, RCNNAcc=0.802546, RCNNLogLoss=1.319950, RCNNL1Loss=0.352934, NMSLoss_pos=0.057433, NMSLoss_neg=0.013499, NMSAcc_pos=0.000000, NMSAcc_neg=1.000000, experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:128: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:129: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:133: RuntimeWarning: invalid value encountered in subtract pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * (pred_w - 1.0) experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:135: RuntimeWarning: invalid value encountered in subtract pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * (pred_h - 1.0) experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:137: RuntimeWarning: invalid value encountered in add pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * (pred_w - 1.0) experiments/relation_rcnn/../../relation_rcnn/../lib/bbox/bbox_transform.py:139: RuntimeWarning: invalid value encountered in add pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * (pred_h - 1.0) experiments/relation_rcnn/../../relation_rcnn/operator_py/proposal.py:180: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] Epoch[0] Batch [400] Speed: 5.02 samples/sec Train-RPNAcc=0.871289, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.810123, RCNNLogLoss=1.576645, RCNNL1Loss=0.334166, NMSLoss_pos=0.054120, NMSLoss_neg=nan, NMSAcc_pos=0.000000, NMSAcc_neg=0.999650, Epoch[0] Batch [500] Speed: 4.91 samples/sec Train-RPNAcc=0.859804, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.836702, RCNNLogLoss=1.888214, RCNNL1Loss=0.267614, NMSLoss_pos=nan, NMSLoss_neg=nan, NMSAcc_pos=0.000000, NMSAcc_neg=0.999720, Epoch[0] Batch [600] Speed: 4.99 samples/sec Train-RPNAcc=0.850682, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.853031, RCNNLogLoss=1.725999, RCNNL1Loss=0.223882, NMSLoss_pos=nan, NMSLoss_neg=nan, NMSAcc_pos=0.000000, NMSAcc_neg=0.999767, Epoch[0] Batch [700] Speed: 4.98 samples/sec Train-RPNAcc=0.844466, RPNLogLoss=nan, RPNL1Loss=nan, RCNNAcc=0.865544, RCNNLogLoss=1.547918, RCNNL1Loss=0.192278, NMSLoss_pos=nan, NMSLoss_neg=nan, NMSAcc_pos=0.000000, NMSAcc_neg=0.999800,

WBinke avatar Jul 02 '18 11:07 WBinke

If you encounter NaN, please try more times until there is no NaN. Some random initialization might cause divergence problem. If problem still exists, it might because the base lr is too large for your task. In this case, please use a smaller base lr.

ancientmooner avatar Sep 10 '18 05:09 ancientmooner

Seconded. Either your data layer is incorrect or you need to alter the learning policy (use smaller base lr, try warmup, ...)

yafz avatar Apr 18 '19 08:04 yafz