TensorFlow2.0-Examples icon indicating copy to clipboard operation
TensorFlow2.0-Examples copied to clipboard

total_loss: nan?

Open dvlee1024 opened this issue 6 years ago • 16 comments

=> STEP  748   lr: 0.000598   giou_loss: 2.10   conf_loss: 6.18   prob_loss: 0.03   total_loss: 8.31
=> STEP  749   lr: 0.000599   giou_loss: 2.54   conf_loss: 6.51   prob_loss: 0.02   total_loss: 9.07
=> STEP  750   lr: 0.000600   giou_loss:  nan   conf_loss: 10.89   prob_loss: 0.06   total_loss:  nan
=> STEP  751   lr: 0.000601   giou_loss:  nan   conf_loss:  nan   prob_loss:  nan   total_loss:  nan
=> STEP  752   lr: 0.000602   giou_loss:  nan   conf_loss:  nan   prob_loss:  nan   total_loss:  nan

dvlee1024 avatar Jul 23 '19 03:07 dvlee1024

看样子是学习率一直在上升导致的Nan,你可以把学习率调小一点,顺便问一下,训练的哪个数据集?

YunYang1994 avatar Jul 23 '19 03:07 YunYang1994

看样子是学习率一直在上升导致的Nan,你可以把学习率调小一点,顺便问一下,训练的哪个数据集?

人脸的,wider face。 学习率不是应该一直下降的吗? @YunYang1994

dvlee1024 avatar Jul 23 '19 04:07 dvlee1024

我知道了,我的数据集大,steps_per_epoch为1250,warmup为10的话,warmup_steps为12500。 我的global_steps一直小于warmup_steps,lr一直处于上升阶段

steps_per_epoch = len(trainset)
warmup_steps = cfg.TRAIN.WARMUP_EPOCHS * steps_per_epoch
total_steps = cfg.TRAIN.EPOCHS * steps_per_epoch
 if global_steps < warmup_steps:
       lr = global_steps / warmup_steps *cfg.TRAIN.LR_INIT
 else:
       lr = cfg.TRAIN.LR_END + 0.5 * (cfg.TRAIN.LR_INIT - cfg.TRAIN.LR_END) * (
                (1 + tf.cos((global_steps - warmup_steps) / (total_steps - warmup_steps) * np.pi))
        )

dvlee1024 avatar Jul 23 '19 05:07 dvlee1024

你打开tensorboard不就知道了

YunYang1994 avatar Jul 23 '19 05:07 YunYang1994

__C.TRAIN.LR_INIT             = 1e-4
__C.TRAIN.LR_END              = 1e-6
__C.TRAIN.WARMUP_EPOCHS       = 4

试试?

YunYang1994 avatar Jul 23 '19 05:07 YunYang1994

__C.TRAIN.LR_INIT             = 1e-4
__C.TRAIN.LR_END              = 1e-6
__C.TRAIN.WARMUP_EPOCHS       = 4

试试?

其实warmup有什么用的,我还打算设置成0

dvlee1024 avatar Jul 23 '19 05:07 dvlee1024

醉了,有什么用?自己看 https://arxiv.org/pdf/1812.01187.pdf

YunYang1994 avatar Jul 23 '19 07:07 YunYang1994

restore上次的weight继续训练,还需要warmup吗? 外行入门,还是要抽空看看书😂

dvlee1024 avatar Jul 23 '19 09:07 dvlee1024

如果loss没有出现Nan,就不用warmup

YunYang1994 avatar Jul 23 '19 13:07 YunYang1994

I'm having the same issue. Could I please get an english explanation?

SinclairHudson avatar Aug 04 '19 18:08 SinclairHudson

@YunYang1994 could I get a quick english translation please?

SinclairHudson avatar Aug 14 '19 13:08 SinclairHudson

Any update on this?

aHandToHelp avatar Oct 09 '19 21:10 aHandToHelp

I am facing same problem, any updates on this?

mkarlan avatar Feb 13 '20 13:02 mkarlan

I solved the issue by reducing the learning rate and using warmup epochs. The learning rate slowly increases and then decreases, and never gets too high. This will prevent the model from diverging (NaN loss). Hope this helps!

SinclairHudson avatar Feb 14 '20 12:02 SinclairHudson

Any update on this?

if your giou firstly turned out nan, it is likely that there is something wrong in the defined giou function. In my experiment, I found the union_area = 0, so the IOU = infinity. Correspondingly, you could debug it by edit the giou function. My improper method is adding a small enough number in the end of this place: (because I haven't really find the root cause of this bug)

union_area = boxes1_area + boxes2_area - inter_area + 1e-10

forever208 avatar Jun 07 '20 03:06 forever208

Any update on this?

if your giou firstly turned out nan, it is likely that there is something wrong in the defined giou function. In my experiment, I found the union_area = 0, so the IOU = infinity. Correspondingly, you could debug it by edit the giou function. My improper method is adding a small enough number in the end of this place: (because I haven't really find the root cause of this bug)

union_area = boxes1_area + boxes2_area - inter_area + 1e-10

already try this, and seems working fine. Thanks!

IqbalLx avatar Sep 01 '20 08:09 IqbalLx