Deformable-ConvNets icon indicating copy to clipboard operation
Deformable-ConvNets copied to clipboard

FPN training: divide by zero, RPNL1Loss explodes

Open smorrel1 opened this issue 6 years ago • 8 comments

Hi, please could you assist. I'm training FPN on Coco as per the instructions and get a large RPNL1Loss. It is coming down v v slowly and I suspect training may not work, or at least be delayed a lot.

Any assistance appreciated! Thanks, Stephen log-error.txt

smorrel1 avatar Jan 15 '18 20:01 smorrel1

The same situation occurs to me when i use FPN+ResNet101+dcn to train my own dataset. But the same data works fine for ResNet101+dcn.

Bad log looks like follows:

Epoch[0] Batch [100] Speed: 3.10 samples/sec Train-RPNAcc=0.714563, RPNLogLoss=0.677545, RPNL1Loss=0.119950, Proposal FG Fraction=0.008675, R-CNN FG Accuracy=0.034800, RCNNAcc=0.956340, RCNNLogLoss=1.054744, RCNNL1Loss=191189370617.151367,

Epoch[0] Batch [200] Speed: 3.12 samples/sec Train-RPNAcc=0.720455, RPNLogLoss=0.663638, RPNL1Loss=0.113055, Proposal FG Fraction=0.008540, R-CNN FG Accuracy=0.033646, RCNNAcc=0.954282, RCNNLogLoss=1.296537, RCNNL1Loss=257069246790015490457600.000000,

Epoch[0] Batch [300] Speed: 3.08 samples/sec Train-RPNAcc=0.721229, RPNLogLoss=0.648105, RPNL1Loss=0.111896, Proposal FG Fraction=0.008614, R-CNN FG Accuracy=0.038954, RCNNAcc=0.953531, RCNNLogLoss=nan, RCNNL1Loss=nan,

fighting-liu avatar Jan 16 '18 06:01 fighting-liu

I encountered the same question as yours, have you solved it? @smorrel1

LiangSiyuan21 avatar Jan 17 '18 02:01 LiangSiyuan21

Did you use the default learning rate (0.01) ? If you use only one GPU for trining try to set lr = 0.00125

Puzer avatar Feb 10 '18 21:02 Puzer

@Puzer Thanks that solved it! Yes I used the default lr=0.01 with 2 GPUs (I now have 4 and 0.005 works). Maybe we should use lr=0.00125 * number of GPUs?

smorrel1 avatar Feb 10 '18 22:02 smorrel1

I received the same problem like you image

hedes1992 avatar Mar 02 '18 08:03 hedes1992

I have changed the learning rate to 1e-5, but the error still raised.

Kongsea avatar Jun 22 '18 02:06 Kongsea

I solved the problem(at least it worked in my case) by changing source code in

  1. lib/dataset/pascal_voc.py, about 175~178 lines, comment -1 just like below:
x1 = float(bbox.find('xmin').text) #- 1 
y1 = float(bbox.find('ymin').text) #- 1
x2 = float(bbox.find('xmax').text) #- 1
y2 = float(bbox.find('ymax').text) #- 1
  1. lib/dataset/imdb.py, about 210 lines, add code below:
for b in range(len(boxes)): 
    if boxes[b][2]< boxes[b][0]: 
        boxes[b][0] = 0

Because in VOC format, it's pixel indexes are 0-based, if you do not transfer your data accordingly, 0 minus 1 will result in 65535, which will cause training loss NAN. You can add print boxes before assert (boxes[:, 2] >= boxes[:, 0]).all() to see the wrong coords.

Hope it helps.

maozezhong avatar Jun 22 '18 08:06 maozezhong

I found it to be a combination of the <1 box edges and the higher lerning rate for less GPUs

  1. Make sure you have in your dataset loaders when making the boxes something like: boxes[ix, :] = [max(x1,1), max(y1,1), x2, y2]

    For coco I also changed to:

x1 = np.max((1, x))
y1 = np.max((1, y))
x2 = np.min((width - 1, x1 + np.max((1, w - 1))))
y2 = np.min((height - 1, y1 + np.max((1, h - 1))))
if obj['area'] > 0 and x2 > x1 and y2 > y1:
  1. In the imdb.py file for when it makes the flipped boxes add some code to before the assert (boxes[:, 2] >= boxes[:, 0]).all() as suggested above or I did:
boxes[:, 0] = roi_rec['width'] - oldx2  # - 1
boxes[:, 2] = roi_rec['width'] - oldx1  # - 1
boxes[boxes < 1] = 1  # used to ensure flipped boxes are also 1+ in coords
for b in range(len(boxes)):
    if boxes[b][2] <= boxes[b][0]:
        boxes[b][2] = boxes[b][0]+1
assert (boxes[:, 2] > boxes[:, 0]).all()
  1. Use a learning rate of 0.00125 * num_gpus

HaydenFaulkner avatar Sep 25 '18 10:09 HaydenFaulkner