pytorch-dual-learning icon indicating copy to clipboard operation
pytorch-dual-learning copied to clipboard

reward turns to nan

Open iriscxy opened this issue 6 years ago • 8 comments

As training moves on, the reward and loss all become 'nan'. Has this problem existed in your data? A -> B ('[s]', 'Old power means the fossil ##AT##-##AT## nuclear energies : oil , natural gas , coal and uranium exploited in centralised , monopolistic energy systems supported by short ##AT##-##AT## term thinking politics .') ('[smid]', ' Interaktion Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Fachkompetenz Schecks') r1= nan r2= nan rk= nan fw_loss= nan bw_loss= nan A loss = nan B loss = nan

iriscxy avatar Mar 22 '18 11:03 iriscxy

Do the reward and loss become 'nan' all the time? At which step? The 2 pre-trained NMT models impact the result a lot. How about pre-training more or trying other data, and see what will happen?

JCly-rikiu avatar Mar 26 '18 09:03 JCly-rikiu

I met the same problem... I tried to pre-train on 20% data and for 100 epoches, and tried on a better dataset but still failed, My tutor told me to change the learning rate, from 1e-3 to 1e-5, and it worked well longer than before, but still failed after about 200 steps, and now 1e-6 is running... So does a lower learning rate really useful? It seems 1e-3 will fail very quickly(about 50 steps), and lower the lr will make it later. And what does nan means? The language model fails? The NMT fails? Thanks for your patience :)

wky9710 avatar Apr 09 '18 17:04 wky9710

I finally found that I met the same problem as you,when it trys to generate words in beam(), the new_hyp_scores turn to nan at about 1000 steps, then I changed the learning rate, from 1e-3 to 1e-5 as suggested above,it worked well longer than before, I think the result shows that the nmt model must be train more,and next step I want to change the optimizer,such as adam. If you find some useful methods, please tell me how to do it.Thank you :)

yangkexin avatar Apr 10 '18 03:04 yangkexin

We have tired adam before, but the result was bad. We think maybe the reason is that adam changes learning rate constantly. And the loss of translation is not smooth, that makes training process out of control, so the loss can't decrease.

JCly-rikiu avatar Apr 16 '18 07:04 JCly-rikiu

After several steps(about 20), with learning rate of 1e-6(maybe small enough... 1e-3 is also tried, and loss turned to nan after 2 steps, even before saving a model...), the loss turns to nan again... I've tried to retrain the nmt model, for about 100w iters, with bleu of about 33.7 for modelA and 15.5 for modelB, but it just won't work... Does it means that whether the method works heavily depends on the data? Or the nmt model?

wky9710 avatar May 05 '18 17:05 wky9710

  1. Yes, this method depends heavily on the data. We have read a review mentioned about that.
  2. We think the problem of nan loss is probably due to the gradient exploding, but we didn't have the nan loss anymore after we changed the optimizer to SGD.

JCly-rikiu avatar May 08 '18 13:05 JCly-rikiu

After several steps(about 20), with learning rate of 1e-6(maybe small enough... 1e-3 is also tried, and loss turned to nan after 2 steps, even before saving a model...), the loss turns to nan again... I've tried to retrain the nmt model, for about 100w iters, with bleu of about 33.7 for modelA and 15.5 for modelB, but it just won't work... Does it means that whether the method works heavily depends on the data? Or the nmt model?

I think the Nan problem comes from the reward calculation, because the reward is divided by std but std can be zero, so changing the reward form may solve the problem.

oceanypt avatar Jan 08 '19 11:01 oceanypt

I also meet with this problem. Has anyone found any method to solve this?

fuzihaofzh avatar Jul 25 '19 12:07 fuzihaofzh