faster-rcnn.pytorch Loss always be nan

Hi, I'm trying to use the code to train my data, but it can not work

Sometimes in the first print_log, loss is not nan:

[session 1][epoch 1][iter 0/ 300] loss: 4.0862, lr: 1.00e-02 fg/bg=(58/4038), time cost: 10.137277 rpn_cls: 0.7353, rpn_box: 1.1108, rcnn_cls: 2.2178, rcnn_box 0.0223 but it will absolutely change to nan in the next print_log:

[session 1][epoch 1][iter 100/ 300] loss: nan, lr: 1.00e-02 fg/bg=(4096/0), time cost: 162.149158 rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan

And sometimes in the first print_log,loss is nan: [session 1][epoch 1][iter 0/ 300] loss: nan, lr: 1.00e-03 fg/bg=(11/2037), time cost: 9.682561 rpn_cls: 0.7243, rpn_box: nan, rcnn_cls: 2.3749, rcnn_box 0.0000

And I have change the lr into 1e-10, but it didn't work

I don't know how to solve the problem.

I have checked the output of rpn_bbox_inside_weights and rpn_bbox_outside_box, it is just full of 0 element.

Oct 13 '18 03:10 CXY573

I have the same issue. The loss is not nan on the first iteration, but it gets to nan from second iteration onwards

Oct 16 '18 04:10 DebasmitaGhose

maybe your x,y coordinates has -1

Oct 17 '18 13:10 Bigwode

@DebasmitaGhose Have you solved it ? i met the same problem. On first and second epochs are not nan, and it gets to nan from third epoch.

Dec 24 '18 13:12 Tianlock

Yes I did. Couple of things: Check the function bbox_transform_batch(ex_rois, gt_rois) in lib/model/rpn/bbox_transform.py. It uses a logarithm to calculate tagets_dw and targets_dh. Check if these two variables are negative. If they are, check your annotations if the values of x1 is greater than x2 and y1 is greater than y2. If that is not the case. you can fix that in your code. In general, check the format of your annotations if they are of the form x1,y1,x2,y2 or x1.y1,w,h. This code takes annotations in the form x1,y1,x2,y2 but some datasets provide annotations in different forms, so you should keep an eye for that.

Dec 24 '18 18:12 DebasmitaGhose

found similar issue, check #594

Aug 04 '19 15:08 marcunzueta

I ran into the same problem and solved it by fixing x, y. And I found that I need to delete the cache in the data directory, otherwise the changes will not work.

Sep 29 '19 13:09 wanghaijie2017

嗨，我正在尝试使用代码来训练我的数据，但无法正常工作

有时在第一个print_log中，损失不是nan：

[session 1] [epoch 1] [iter 0/300]损失：4.0862，lr：1.00e-02 fg / bg =（58/4038），时间成本：10.137277 rpn_cls：0.7353，rpn_box：1.1108，rcnn_cls：2.2178， rcnn_box 0.0223，但在下一个print_log中它将绝对更改为nan：

[session 1] [epoch 1] [iter 100/300]损失：nan，lr：1.00e-02 fg / bg =（4096/0），时间成本：162.149158 rpn_cls：nan，rpn_box：nan，rcnn_cls：nan， rcnn_box nan

有时在第一个print_log中，损失为nan： [会话1] [epoch 1] [iter 0/300]损失：nan，lr：1.00e-03 fg / bg =（11/2037），时间成本：9.682561 rpn_cls ：0.7243，rpn_box：nan，rcnn_cls：2.3749，rcnn_box 0.0000

我已经将lr更改为1e-10，但是没有用

我不知道该怎么解决。

我已经检查了rpn_bbox_inside_weights和rpn_bbox_outside_box的输出，它仅充满了0个元素。

请问你解决了吗，第二次迭代变成nan，属于正常情况吗？

Feb 28 '20 07:02 SunLeL

I ran into the same problem and solved it by fixing x, y. And I found that I need to delete the cache in the data directory, otherwise the changes will not work.

love you

Apr 29 '21 12:04 heypaprika