Deformable-ConvNets
Deformable-ConvNets copied to clipboard
FPN training: divide by zero, RPNL1Loss explodes
Hi, please could you assist. I'm training FPN on Coco as per the instructions and get a large RPNL1Loss. It is coming down v v slowly and I suspect training may not work, or at least be delayed a lot.
Any assistance appreciated! Thanks, Stephen log-error.txt
The same situation occurs to me when i use FPN+ResNet101+dcn to train my own dataset. But the same data works fine for ResNet101+dcn.
Bad log looks like follows:
Epoch[0] Batch [100] Speed: 3.10 samples/sec Train-RPNAcc=0.714563, RPNLogLoss=0.677545, RPNL1Loss=0.119950, Proposal FG Fraction=0.008675, R-CNN FG Accuracy=0.034800, RCNNAcc=0.956340, RCNNLogLoss=1.054744, RCNNL1Loss=191189370617.151367,
Epoch[0] Batch [200] Speed: 3.12 samples/sec Train-RPNAcc=0.720455, RPNLogLoss=0.663638, RPNL1Loss=0.113055, Proposal FG Fraction=0.008540, R-CNN FG Accuracy=0.033646, RCNNAcc=0.954282, RCNNLogLoss=1.296537, RCNNL1Loss=257069246790015490457600.000000,
Epoch[0] Batch [300] Speed: 3.08 samples/sec Train-RPNAcc=0.721229, RPNLogLoss=0.648105, RPNL1Loss=0.111896, Proposal FG Fraction=0.008614, R-CNN FG Accuracy=0.038954, RCNNAcc=0.953531, RCNNLogLoss=nan, RCNNL1Loss=nan,
I encountered the same question as yours, have you solved it? @smorrel1
Did you use the default learning rate (0.01) ? If you use only one GPU for trining try to set lr = 0.00125
@Puzer Thanks that solved it! Yes I used the default lr=0.01 with 2 GPUs (I now have 4 and 0.005 works). Maybe we should use lr=0.00125 * number of GPUs?
I received the same problem like you
I have changed the learning rate to 1e-5, but the error still raised.
I solved the problem(at least it worked in my case) by changing source code in
- lib/dataset/pascal_voc.py, about 175~178 lines, comment
-1
just like below:
x1 = float(bbox.find('xmin').text) #- 1
y1 = float(bbox.find('ymin').text) #- 1
x2 = float(bbox.find('xmax').text) #- 1
y2 = float(bbox.find('ymax').text) #- 1
- lib/dataset/imdb.py, about 210 lines, add code below:
for b in range(len(boxes)):
if boxes[b][2]< boxes[b][0]:
boxes[b][0] = 0
Because in VOC format, it's pixel indexes are 0-based, if you do not transfer your data accordingly, 0 minus 1 will result in 65535, which will cause training loss NAN. You can add print boxes
before assert (boxes[:, 2] >= boxes[:, 0]).all()
to see the wrong coords.
Hope it helps.
I found it to be a combination of the <1 box edges and the higher lerning rate for less GPUs
-
Make sure you have in your dataset loaders when making the boxes something like:
boxes[ix, :] = [max(x1,1), max(y1,1), x2, y2]
For coco I also changed to:
x1 = np.max((1, x))
y1 = np.max((1, y))
x2 = np.min((width - 1, x1 + np.max((1, w - 1))))
y2 = np.min((height - 1, y1 + np.max((1, h - 1))))
if obj['area'] > 0 and x2 > x1 and y2 > y1:
- In the
imdb.py
file for when it makes the flipped boxes add some code to before theassert (boxes[:, 2] >= boxes[:, 0]).all()
as suggested above or I did:
boxes[:, 0] = roi_rec['width'] - oldx2 # - 1
boxes[:, 2] = roi_rec['width'] - oldx1 # - 1
boxes[boxes < 1] = 1 # used to ensure flipped boxes are also 1+ in coords
for b in range(len(boxes)):
if boxes[b][2] <= boxes[b][0]:
boxes[b][2] = boxes[b][0]+1
assert (boxes[:, 2] > boxes[:, 0]).all()
- Use a learning rate of 0.00125 * num_gpus