TensorFlow2.0-Examples
TensorFlow2.0-Examples copied to clipboard
total_loss: nan?
=> STEP 748 lr: 0.000598 giou_loss: 2.10 conf_loss: 6.18 prob_loss: 0.03 total_loss: 8.31
=> STEP 749 lr: 0.000599 giou_loss: 2.54 conf_loss: 6.51 prob_loss: 0.02 total_loss: 9.07
=> STEP 750 lr: 0.000600 giou_loss: nan conf_loss: 10.89 prob_loss: 0.06 total_loss: nan
=> STEP 751 lr: 0.000601 giou_loss: nan conf_loss: nan prob_loss: nan total_loss: nan
=> STEP 752 lr: 0.000602 giou_loss: nan conf_loss: nan prob_loss: nan total_loss: nan
看样子是学习率一直在上升导致的Nan,你可以把学习率调小一点,顺便问一下,训练的哪个数据集?
看样子是学习率一直在上升导致的Nan,你可以把学习率调小一点,顺便问一下,训练的哪个数据集?
人脸的,wider face。 学习率不是应该一直下降的吗? @YunYang1994
我知道了,我的数据集大,steps_per_epoch为1250,warmup为10的话,warmup_steps为12500。 我的global_steps一直小于warmup_steps,lr一直处于上升阶段
steps_per_epoch = len(trainset)
warmup_steps = cfg.TRAIN.WARMUP_EPOCHS * steps_per_epoch
total_steps = cfg.TRAIN.EPOCHS * steps_per_epoch
if global_steps < warmup_steps:
lr = global_steps / warmup_steps *cfg.TRAIN.LR_INIT
else:
lr = cfg.TRAIN.LR_END + 0.5 * (cfg.TRAIN.LR_INIT - cfg.TRAIN.LR_END) * (
(1 + tf.cos((global_steps - warmup_steps) / (total_steps - warmup_steps) * np.pi))
)
你打开tensorboard不就知道了
__C.TRAIN.LR_INIT = 1e-4
__C.TRAIN.LR_END = 1e-6
__C.TRAIN.WARMUP_EPOCHS = 4
试试?
__C.TRAIN.LR_INIT = 1e-4 __C.TRAIN.LR_END = 1e-6 __C.TRAIN.WARMUP_EPOCHS = 4试试?
其实warmup有什么用的,我还打算设置成0
醉了,有什么用?自己看 https://arxiv.org/pdf/1812.01187.pdf
restore上次的weight继续训练,还需要warmup吗? 外行入门,还是要抽空看看书😂
如果loss没有出现Nan,就不用warmup
I'm having the same issue. Could I please get an english explanation?
@YunYang1994 could I get a quick english translation please?
Any update on this?
I am facing same problem, any updates on this?
I solved the issue by reducing the learning rate and using warmup epochs. The learning rate slowly increases and then decreases, and never gets too high. This will prevent the model from diverging (NaN loss). Hope this helps!
Any update on this?
if your giou firstly turned out nan, it is likely that there is something wrong in the defined giou function. In my experiment, I found the union_area = 0, so the IOU = infinity. Correspondingly, you could debug it by edit the giou function. My improper method is adding a small enough number in the end of this place: (because I haven't really find the root cause of this bug)
union_area = boxes1_area + boxes2_area - inter_area + 1e-10
Any update on this?
if your giou firstly turned out nan, it is likely that there is something wrong in the defined giou function. In my experiment, I found the union_area = 0, so the IOU = infinity. Correspondingly, you could debug it by edit the giou function. My improper method is adding a small enough number in the end of this place: (because I haven't really find the root cause of this bug)
union_area = boxes1_area + boxes2_area - inter_area + 1e-10
already try this, and seems working fine. Thanks!