Detectron.pytorch icon indicating copy to clipboard operation
Detectron.pytorch copied to clipboard

Loss keep increasing

Open SingL3 opened this issue 5 years ago • 2 comments

Hello! I am trying to train on my own dataset, and I got a loss like below:

[Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 201 / 150000] loss: 0.041566, lr: 0.002250 time: 1.166229, eta: 2 days, 0:31:41 accuracy_cls: 0.997662 loss_cls: 0.016658, loss_bbox: 0.002761 loss_rpn_cls: 0.013647, loss_rpn_bbox: 0.002336 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 0.006667, loss_rpn_cls_fpn4: 0.003009, loss_rpn_cls_fpn5: 0.000830 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 0.000291, loss_rpn_bbox_fpn4: 0.000090, loss_rpn_bbox_fpn5: 0.000000 [Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 221 / 150000] loss: 0.102637, lr: 0.002350 time: 1.162983, eta: 2 days, 0:23:11 accuracy_cls: 0.992798 loss_cls: 0.048543, loss_bbox: 0.021517 loss_rpn_cls: 0.021783, loss_rpn_bbox: 0.003471 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 0.009707, loss_rpn_cls_fpn4: 0.008061, loss_rpn_cls_fpn5: 0.000740 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 0.001668, loss_rpn_bbox_fpn4: 0.000856, loss_rpn_bbox_fpn5: 0.000000 [Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 241 / 150000] loss: 1.025869, lr: 0.002450 time: 1.151779, eta: 1 day, 23:54:50 accuracy_cls: 0.996862 loss_cls: 0.030038, loss_bbox: 0.010628 loss_rpn_cls: 0.356613, loss_rpn_bbox: 0.303246 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 0.022924, loss_rpn_cls_fpn4: 0.019702, loss_rpn_cls_fpn5: 0.001183 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 0.001148, loss_rpn_bbox_fpn4: 0.005298, loss_rpn_bbox_fpn5: 0.000000 [Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 261 / 150000] loss: 75.125702, lr: 0.002550 time: 1.126814, eta: 1 day, 22:52:09 accuracy_cls: 0.969669 loss_cls: 0.141362, loss_bbox: 0.050388 loss_rpn_cls: 22.343872, loss_rpn_bbox: 37.826141 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 9.277060, loss_rpn_cls_fpn4: 10.976149, loss_rpn_cls_fpn5: 0.000000 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 22.297930, loss_rpn_bbox_fpn4: 18.230148, loss_rpn_bbox_fpn5: 0.000000 [Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 281 / 150000] loss: 4103.495605, lr: 0.002650 time: 1.105512, eta: 1 day, 21:58:37 accuracy_cls: 0.971781 loss_cls: 0.153655, loss_bbox: 0.010857 loss_rpn_cls: 1187.903809, loss_rpn_bbox: 912.912476 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 236.092331, loss_rpn_cls_fpn4: 391.779114, loss_rpn_cls_fpn5: 0.000000 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 667.822815, loss_rpn_bbox_fpn4: 86.375725, loss_rpn_bbox_fpn5: 0.000000

and the loss will just keep increasing to an unexpectable big number. Previously, I got error like this

lib/utils/boxes.py:66: RuntimeWarning: Negative areas founds: 2 warnings.warn("Negative areas founds: %d" % neg_area_idx.size, RuntimeWarning)

so I try the solution by Ross in https://github.com/facebookresearch/Detectron/commit/47e457a581c2623aeaf18156ad3c0b0eb56c9cd8

And now I got this loss. Do you know why this happen?

SingL3 avatar Dec 21 '18 07:12 SingL3

Hello! I am trying to train on my own dataset, and I got a loss like below:

[Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 201 / 150000] loss: 0.041566, lr: 0.002250 time: 1.166229, eta: 2 days, 0:31:41 accuracy_cls: 0.997662 loss_cls: 0.016658, loss_bbox: 0.002761 loss_rpn_cls: 0.013647, loss_rpn_bbox: 0.002336 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 0.006667, loss_rpn_cls_fpn4: 0.003009, loss_rpn_cls_fpn5: 0.000830 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 0.000291, loss_rpn_bbox_fpn4: 0.000090, loss_rpn_bbox_fpn5: 0.000000 [Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 221 / 150000] loss: 0.102637, lr: 0.002350 time: 1.162983, eta: 2 days, 0:23:11 accuracy_cls: 0.992798 loss_cls: 0.048543, loss_bbox: 0.021517 loss_rpn_cls: 0.021783, loss_rpn_bbox: 0.003471 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 0.009707, loss_rpn_cls_fpn4: 0.008061, loss_rpn_cls_fpn5: 0.000740 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 0.001668, loss_rpn_bbox_fpn4: 0.000856, loss_rpn_bbox_fpn5: 0.000000 [Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 241 / 150000] loss: 1.025869, lr: 0.002450 time: 1.151779, eta: 1 day, 23:54:50 accuracy_cls: 0.996862 loss_cls: 0.030038, loss_bbox: 0.010628 loss_rpn_cls: 0.356613, loss_rpn_bbox: 0.303246 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 0.022924, loss_rpn_cls_fpn4: 0.019702, loss_rpn_cls_fpn5: 0.001183 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 0.001148, loss_rpn_bbox_fpn4: 0.005298, loss_rpn_bbox_fpn5: 0.000000 [Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 261 / 150000] loss: 75.125702, lr: 0.002550 time: 1.126814, eta: 1 day, 22:52:09 accuracy_cls: 0.969669 loss_cls: 0.141362, loss_bbox: 0.050388 loss_rpn_cls: 22.343872, loss_rpn_bbox: 37.826141 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 9.277060, loss_rpn_cls_fpn4: 10.976149, loss_rpn_cls_fpn5: 0.000000 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 22.297930, loss_rpn_bbox_fpn4: 18.230148, loss_rpn_bbox_fpn5: 0.000000 [Dec21-15-15-54_user-SYS-7048GR-TR_step][FPN_SE_ARP.yml][Step 281 / 150000] loss: 4103.495605, lr: 0.002650 time: 1.105512, eta: 1 day, 21:58:37 accuracy_cls: 0.971781 loss_cls: 0.153655, loss_bbox: 0.010857 loss_rpn_cls: 1187.903809, loss_rpn_bbox: 912.912476 loss_rpn_cls_fpn2: 0.000000, loss_rpn_cls_fpn3: 236.092331, loss_rpn_cls_fpn4: 391.779114, loss_rpn_cls_fpn5: 0.000000 loss_rpn_bbox_fpn2: 0.000000, loss_rpn_bbox_fpn3: 667.822815, loss_rpn_bbox_fpn4: 86.375725, loss_rpn_bbox_fpn5: 0.000000

and the loss will just keep increasing to an unexpectable big number. Previously, I got error like this

lib/utils/boxes.py:66: RuntimeWarning: Negative areas founds: 2 warnings.warn("Negative areas founds: %d" % neg_area_idx.size, RuntimeWarning)

so I try the solution by Ross in https://github.com/facebookresearch/Detectron/commit/47e457a581c2623aeaf18156ad3c0b0eb56c9cd8

And now I got this loss. Do you know why this happen? same as Me! But after a time of training, the loss will be decreasing

hzhang33BEI avatar Dec 27 '18 08:12 hzhang33BEI

Nope. My loss just increased to something like 10 million and never decrease. And this loss is not reasonable. Actually, I am trying to train a model that can accept images without objects, so I have changed some code and I am not sure if it is the reason.

SingL3 avatar Dec 27 '18 08:12 SingL3