lffd-pytorch icon indicating copy to clipboard operation
lffd-pytorch copied to clipboard

train loss change from normal to NAN

Open dtiny opened this issue 5 years ago • 15 comments

Provided code : python configuration_10_320_20L_5scales_v2.py Provided data : widerface_train_data_gt_8.pkl At begining, train loss converge normal. iteration times 3400, loss was divergent to nan.
How to solve this problem.

dtiny avatar Oct 29 '19 07:10 dtiny

我也出现了同样的问题

coderhss avatar Dec 01 '19 11:12 coderhss

same problem +1

xinyikb avatar Dec 16 '19 13:12 xinyikb

Have you found the inference code

Brain-Lee avatar Dec 17 '19 06:12 Brain-Lee

代码有问题:

  1. loss写的不对,在难例挖掘那块
  2. gray区域在loss里面也没有使用

把1改过来,如不行再降低初始学习率;2可改可不改

120276215 avatar Dec 17 '19 08:12 120276215

出现同样的问题+1

suyue6 avatar Dec 18 '19 07:12 suyue6

代码有问题:

  1. loss写的不对,在难例挖掘那块
  2. gray区域在loss里面也没有使用

把1改过来,如不行再降低初始学习率;2可改可不改

你好,请教具体怎么改呀,谢谢~~

suyue6 avatar Dec 18 '19 07:12 suyue6

有人解决这个问题了吗

Jialeen avatar Dec 20 '19 08:12 Jialeen

代码有问题:

  1. loss写的不对,在难例挖掘那块
  2. gray区域在loss里面也没有使用

把1改过来,如不行再降低初始学习率;2可改可不改

你好,请教具体怎么改呀,谢谢~~

https://github.com/becauseofAI/lffd-pytorch/blob/f7da857f7ea939665b81d7bfedb98d02f4147723/ChasingTrainFramework_GeneralOneClassDetection/loss_layer_farm/loss.py#L112

改为: torch.ones_like(pred_score_softmax[:, 1, :, :]).add(1))

120276215 avatar Dec 20 '19 10:12 120276215

代码有问题:

  1. loss写的不对,在难例挖掘那块
  2. gray区域在loss里面也没有使用

把1改过来,如不行再降低初始学习率;2可改可不改

你好,请教具体怎么改呀,谢谢~~

https://github.com/becauseofAI/lffd-pytorch/blob/f7da857f7ea939665b81d7bfedb98d02f4147723/ChasingTrainFramework_GeneralOneClassDetection/loss_layer_farm/loss.py#L112

改为: torch.ones_like(pred_score_softmax[:, 1, :, :]).add(1))

这样修改还是有NAN

Jialeen avatar Dec 23 '19 01:12 Jialeen

the same problem when training.

chenjun2hao avatar Jan 09 '20 06:01 chenjun2hao

@becauseofAI Any suggestions?

deep-practice avatar Feb 16 '20 11:02 deep-practice

Did anyone find any solution?

Manideep08 avatar Sep 08 '20 12:09 Manideep08

Anyone found the solution to this problem?

junaiddk avatar Nov 13 '20 04:11 junaiddk

无力吐槽,这代码放出来专门坑人的

afterimagex avatar Dec 25 '20 07:12 afterimagex

Try reducing the learning rate (variable name = param_learning_rate) to 0.01 in the configuration file. If you are using V2, it should be configuration_10_320_20L_5scales_v2.py. This worked for me to train for 2000000 training loops. EDIT: I see that user 120276215 has already advised the same. So credits to him/her.

CodexForster avatar Mar 23 '21 15:03 CodexForster