Deformable-ConvNets RPNL1Loss=nan

What does this mean? As I know it is not correct. In Epoch[0] RPNL1Loss = 0.403792. Then always nan

Epoch[0] Batch [300] Speed: 1.15 samples/sec Train-RPNAcc=0.812539, RPNLogLoss=0.570043, Epoch[0] Batch [400] Speed: 1.16 samples/sec Train-RPNAcc=0.821910, RPNLogLoss=0.599778, Epoch[0] Batch [500] Speed: 1.14 samples/sec Train-RPNAcc=0.828243, RPNLogLoss=0.617132, Epoch[0] Batch [600] Speed: 1.15 samples/sec Train-RPNAcc=0.832391, RPNLogLoss=0.628403, Epoch[0] Batch [700] Speed: 1.16 samples/sec Train-RPNAcc=0.834384, RPNLogLoss=0.636100, Epoch[0] Batch [800] Speed: 1.16 samples/sec Train-RPNAcc=0.836050, RPNLogLoss=0.641726, Epoch[0] Batch [900] Speed: 1.14 samples/sec Train-RPNAcc=0.837489, RPNLogLoss=0.645880, Epoch[0] Batch [1000] Speed: 1.16 samples/sec Train-RPNAcc=0.838438, RPNLogLoss=0.649027, Epoch[0] Batch [1100] Speed: 1.16 samples/sec Train-RPNAcc=0.839179, RPNLogLoss=0.650772, Epoch[0] Batch [1200] Speed: 1.15 samples/sec Train-RPNAcc=0.839684, RPNLogLoss=0.650882, Epoch[0] Batch [1300] Speed: 1.15 samples/sec Train-RPNAcc=0.840592, RPNLogLoss=0.649718, Epoch[0] Batch [1400] Speed: 1.14 samples/sec Train-RPNAcc=0.841459, RPNLogLoss=0.647637, Epoch[0] Batch [1500] Speed: 1.16 samples/sec Train-RPNAcc=0.841013, RPNLogLoss=0.645282, Epoch[0] Batch [1600] Speed: 1.16 samples/sec Train-RPNAcc=0.841707, RPNLogLoss=0.642184, Epoch[0] Batch [1700] Speed: 1.16 samples/sec Train-RPNAcc=0.841856, RPNLogLoss=0.638847, Epoch[0] Batch [1800] Speed: 1.16 samples/sec Train-RPNAcc=0.842007, RPNLogLoss=0.635248, Epoch[0] Batch [1900] Speed: 1.17 samples/sec Train-RPNAcc=0.842244, RPNLogLoss=0.631594, Epoch[0] Batch [2000] Speed: 1.17 samples/sec Train-RPNAcc=0.842616, RPNLogLoss=0.627715, Epoch[0] Batch [2100] Speed: 1.18 samples/sec Train-RPNAcc=0.843182, RPNLogLoss=0.623727, Epoch[0] Batch [2200] Speed: 1.18 samples/sec Train-RPNAcc=0.843663, RPNLogLoss=0.619788, Epoch[0] Batch [2300] Speed: 1.17 samples/sec Train-RPNAcc=0.844234, RPNLogLoss=0.615760, Epoch[0] Batch [2400] Speed: 1.18 samples/sec Train-RPNAcc=0.844505, RPNLogLoss=0.611821, Epoch[0] Batch [2500] Speed: 1.16 samples/sec Train-RPNAcc=0.844367, RPNLogLoss=0.608176, Epoch[0] Batch [2600] Speed: 1.18 samples/sec Train-RPNAcc=0.844443, RPNLogLoss=0.604457, RPNL1Loss=nan, RCNNAcc=0.767390, RCNNLogLoss=3.596807, RCNNL1Loss=0.010698, RPNL1Loss=nan, RCNNAcc=0.818267, RCNNLogLoss=3.577709, RCNNL1Loss=0.008031, RPNL1Loss=nan, RCNNAcc=0.848381, RCNNLogLoss=3.396140, RCNNL1Loss=0.006434, RPNL1Loss=nan, RCNNAcc=0.869345, RCNNLogLoss=2.870057, RCNNL1Loss=0.005387, RPNL1Loss=nan, RCNNAcc=0.883437, RCNNLogLoss=2.499374, RCNNL1Loss=0.004622, RPNL1Loss=nan, RCNNAcc=0.894673, RCNNLogLoss=2.215910, RCNNL1Loss=0.004046, RPNL1Loss=nan, RCNNAcc=0.903519, RCNNLogLoss=1.994231, RCNNL1Loss=0.003597, RPNL1Loss=nan, RCNNAcc=0.910550, RCNNLogLoss=1.816863, RCNNL1Loss=0.003238, RPNL1Loss=nan, RCNNAcc=0.915993, RCNNLogLoss=1.673953, RCNNL1Loss=0.003987, RPNL1Loss=nan, RCNNAcc=0.920535, RCNNLogLoss=1.553478, RCNNL1Loss=0.003658, RPNL1Loss=nan, RCNNAcc=0.924115, RCNNLogLoss=1.452725, RCNNL1Loss=0.003903, RPNL1Loss=nan, RCNNAcc=0.927468, RCNNLogLoss=1.363495, RCNNL1Loss=0.003740, RPNL1Loss=nan, RCNNAcc=0.930463, RCNNLogLoss=1.284670, RCNNL1Loss=0.003492, RPNL1Loss=nan, RCNNAcc=0.932781, RCNNLogLoss=1.216500, RCNNL1Loss=0.003344, RPNL1Loss=nan, RCNNAcc=0.935341, RCNNLogLoss=1.151333, RCNNL1Loss=0.003149, RPNL1Loss=nan, RCNNAcc=0.937574, RCNNLogLoss=1.095160, RCNNL1Loss=0.004580, RPNL1Loss=nan, RCNNAcc=0.939325, RCNNLogLoss=1.043886, RCNNL1Loss=0.004343, RPNL1Loss=nan, RCNNAcc=0.941225, RCNNLogLoss=0.996324, RCNNL1Loss=0.004129, RPNL1Loss=nan, RCNNAcc=0.942862, RCNNLogLoss=0.953685, RCNNL1Loss=0.003934, RPNL1Loss=nan, RCNNAcc=0.944095, RCNNLogLoss=0.915521, RCNNL1Loss=0.003757, RPNL1Loss=nan, RCNNAcc=0.945496, RCNNLogLoss=0.879742, RCNNL1Loss=0.003595, RPNL1Loss=nan, RCNNAcc=0.947011, RCNNLogLoss=0.846207, RCNNL1Loss=0.003446, RPNL1Loss=nan, RCNNAcc=0.947858, RCNNLogLoss=0.819818, RCNNL1Loss=0.004606, RPNL1Loss=nan, RCNNAcc=0.948941, RCNNLogLoss=0.791787, RCNNL1Loss=0.004434,

Jun 02 '17 14:06 mursalal

I got below:

Train-RPNAcc=0.840520
RPNLogLoss=0.726791
RPNL1Loss=27352.047164
RCNNAcc=0.905266
RCNNLogLoss=nan
RCNNL1Loss=nan,

Jun 05 '17 03:06 hzh8311

And in the first 200 iterations, some value of ex_weights and ex_heights are 0, which caused overflow error. I add 1e-14 to ex_weights and ex_heights, it continue training with warning but very small dw and dh generated by nonlinear_pred, and the loss apparently abnormal

Jun 05 '17 03:06 hzh8311

Try to use smaller learning rate, from my understanding, too large learning rate is the most common case when L1 loss get NaN

Jun 23 '17 08:06 YuwenXiong

@hzh8311 Have you solved this problem yet? I encountered this problem too. @YuwenXiong Is there any method except using smaller learning rate?

Aug 14 '17 08:08 franciszzj

@mursalal @hzh8311 I have solved this problem. I think you shoud check your training data again and again, write some code to avoid some extreme values. Besides, you can check your loss setting, which maybe unbalanced.

Aug 19 '17 06:08 franciszzj

check if you use same data type as this. Try changing it to int16. It goes wrong with negative value.

Sep 08 '17 12:09 Godricly

@mursalal I have tried to use the smaller learning rate, but it does not work. Do you have any other advise?

Sep 10 '17 14:09 YaraDuan

@AthenaAlala I don't have. I am sorry. Try to use @Godricly advise.

Sep 13 '17 12:09 mursalal

I am using my own dataser for a experiment.I changed "uint16" to "int16", but i still meet this problem, how do I deal with my dataset to avoid extreme value？

Oct 26 '17 04:10 changzhonghan

@Franciszzj can you be more specific about the extreme values in training data? for example, what issues specifically you encountered in your training data? Besides, for the loss setting, I used the default loss setting, which I thought should be right? How can I check if it is unbalanced?

Jan 12 '18 02:01 yian2271368

@AthenaAlala you need to change the learningrate #146

Mar 27 '18 06:03 lxyyang

Make sure the coordinates of your data's ground truth is in the VOC's format.(not less than 1)

May 14 '18 12:05 travelerxd

I solved the problem(at least it worked in my case) by changing source code in

lib/dataset/pascal_voc.py, about 175~178 lines, comment -1 just like below:

x1 = float(bbox.find('xmin').text) #- 1 
y1 = float(bbox.find('ymin').text) #- 1
x2 = float(bbox.find('xmax').text) #- 1
y2 = float(bbox.find('ymax').text) #- 1

lib/dataset/imdb.py, about 210 lines, add code below:

for b in range(len(boxes)): 
    if boxes[b][2]< boxes[b][0]: 
        boxes[b][0] = 0

Because in VOC format, it's pixel indexes are 0-based, if you do not transfer your data accordingly, 0 minus 1 will result in 65535, which will cause training loss NAN. You can add print boxes before assert (boxes[:, 2] >= boxes[:, 0]).all() to see the wrong coords.

Hope it helps.

Jun 22 '18 08:06 maozezhong

I encountered the same question as yours.I'm training faster-rcnn_dcn on my own dataset. As suggested above,I used lr =0.000125 with 1 GPU(default lr=0.0005 with 4 GPUS).And follow @maozezhong 's instructions,it doesn't work. nan After that I checked my own dataset bbox label<xmin, ymin, xmax, ymax>, maybe the box label has invalid value or xmin =0 or ymin =0.In order to avoid xmin or ymin equal to 0 , I added these codes in mydata2voc.py.It works.

xmin = 1 if xmin == 0 else xmin
ymin = 1 if ymin == 0 else ymin

rpn__yes

Oct 22 '18 13:10 weiiLu

Hi, i encountered the same question as yours and solve this problem. Following @Franciszzj , check the dataset again, however, the RPNL1Loss still equal nan, after rm ./data/cache/xxx_gt_roidb.pkl , the problem solved. It causes by previous bad cache model, so your should clean it before retraining

May 19 '19 08:05 weixia1