TensorFlow2.0-Examples total

=> STEP  748   lr: 0.000598   giou_loss: 2.10   conf_loss: 6.18   prob_loss: 0.03   total_loss: 8.31
=> STEP  749   lr: 0.000599   giou_loss: 2.54   conf_loss: 6.51   prob_loss: 0.02   total_loss: 9.07
=> STEP  750   lr: 0.000600   giou_loss:  nan   conf_loss: 10.89   prob_loss: 0.06   total_loss:  nan
=> STEP  751   lr: 0.000601   giou_loss:  nan   conf_loss:  nan   prob_loss:  nan   total_loss:  nan
=> STEP  752   lr: 0.000602   giou_loss:  nan   conf_loss:  nan   prob_loss:  nan   total_loss:  nan

Jul 23 '19 03:07 dvlee1024

看样子是学习率一直在上升导致的Nan，你可以把学习率调小一点，顺便问一下，训练的哪个数据集？

Jul 23 '19 03:07 YunYang1994

看样子是学习率一直在上升导致的Nan，你可以把学习率调小一点，顺便问一下，训练的哪个数据集？

人脸的，wider face。学习率不是应该一直下降的吗？ @YunYang1994

Jul 23 '19 04:07 dvlee1024

我知道了,我的数据集大，steps_per_epoch为1250，warmup为10的话，warmup_steps为12500。我的global_steps一直小于warmup_steps，lr一直处于上升阶段

steps_per_epoch = len(trainset)
warmup_steps = cfg.TRAIN.WARMUP_EPOCHS * steps_per_epoch
total_steps = cfg.TRAIN.EPOCHS * steps_per_epoch

 if global_steps < warmup_steps:
       lr = global_steps / warmup_steps *cfg.TRAIN.LR_INIT
 else:
       lr = cfg.TRAIN.LR_END + 0.5 * (cfg.TRAIN.LR_INIT - cfg.TRAIN.LR_END) * (
                (1 + tf.cos((global_steps - warmup_steps) / (total_steps - warmup_steps) * np.pi))
        )

Jul 23 '19 05:07 dvlee1024

你打开tensorboard不就知道了

Jul 23 '19 05:07 YunYang1994

__C.TRAIN.LR_INIT             = 1e-4
__C.TRAIN.LR_END              = 1e-6
__C.TRAIN.WARMUP_EPOCHS       = 4

试试？

Jul 23 '19 05:07 YunYang1994

__C.TRAIN.LR_INIT             = 1e-4
__C.TRAIN.LR_END              = 1e-6
__C.TRAIN.WARMUP_EPOCHS       = 4

试试？

其实warmup有什么用的，我还打算设置成0

Jul 23 '19 05:07 dvlee1024

醉了，有什么用？自己看 https://arxiv.org/pdf/1812.01187.pdf

Jul 23 '19 07:07 YunYang1994

restore上次的weight继续训练，还需要warmup吗？外行入门，还是要抽空看看书😂

Jul 23 '19 09:07 dvlee1024

如果loss没有出现Nan，就不用warmup

Jul 23 '19 13:07 YunYang1994

I'm having the same issue. Could I please get an english explanation?

Aug 04 '19 18:08 SinclairHudson

@YunYang1994 could I get a quick english translation please?

Aug 14 '19 13:08 SinclairHudson

Any update on this?

Oct 09 '19 21:10 aHandToHelp

I am facing same problem, any updates on this?

Feb 13 '20 13:02 mkarlan

I solved the issue by reducing the learning rate and using warmup epochs. The learning rate slowly increases and then decreases, and never gets too high. This will prevent the model from diverging (NaN loss). Hope this helps!

Feb 14 '20 12:02 SinclairHudson

Any update on this?

if your giou firstly turned out nan, it is likely that there is something wrong in the defined giou function. In my experiment, I found the union_area = 0, so the IOU = infinity. Correspondingly, you could debug it by edit the giou function. My improper method is adding a small enough number in the end of this place: (because I haven't really find the root cause of this bug)

union_area = boxes1_area + boxes2_area - inter_area + 1e-10

Jun 07 '20 03:06 forever208

Any update on this?

if your giou firstly turned out nan, it is likely that there is something wrong in the defined giou function. In my experiment, I found the union_area = 0, so the IOU = infinity. Correspondingly, you could debug it by edit the giou function. My improper method is adding a small enough number in the end of this place: (because I haven't really find the root cause of this bug)

union_area = boxes1_area + boxes2_area - inter_area + 1e-10

already try this, and seems working fine. Thanks!

Sep 01 '20 08:09 IqbalLx

TensorFlow2.0-Examples
TensorFlow2.0-Examples copied to clipboard

total_loss: nan?

TensorFlow2.0-Examples TensorFlow2.0-Examples copied to clipboard

total_loss: nan?

TensorFlow2.0-Examples
TensorFlow2.0-Examples copied to clipboard