Pytorch-Sketch-RNN icon indicating copy to clipboard operation
Pytorch-Sketch-RNN copied to clipboard

Loss became 'nan' after 10000~20000 epochs

Open LJSthu opened this issue 7 years ago • 7 comments

LJSthu avatar Jul 16 '18 12:07 LJSthu

Hello, thanks for the issue. I am not able to run so many epochs with my personal computer so I can't reproduce it. However, as the loss is the log of a probability that is minimalized, it may be possible to obtain some NaN errors when the probability is minimized.

One solution is to put log(P + epsilon) with epsilon a positive constant. I think I already do that, but maybe you can increase epsilon.

Anyway, you may not need that huge number of epochs, the model already generate very acceptable drawings after 4000 epochs

alexis-jacq avatar Jul 17 '18 10:07 alexis-jacq

Thanks for your reply. I am training the network on the kanji dataset, and I found the results are not acceptable even after 20000 epochs. Have you ever met this problem?

LJSthu avatar Jul 17 '18 12:07 LJSthu

I did not try these data. It's a harder task since if you miss a segment or if an angle is not right, you miss the letter, but I think that David Ha successfully trained sketchrnn for kanji. Another thing you can try is to change the dropout factor (for the cat drawings I used p=0.9, but it should be much smaller, like 0.1). Let me know.

Ps : can you show an example of non-acceptable sample and the respective target?

alexis-jacq avatar Jul 17 '18 17:07 alexis-jacq

I ran your code and the author's code on kanji dataset, but I found that the KL loss in your code was increasing while it was decreasing in the author's output.

LJSthu avatar Aug 15 '18 13:08 LJSthu

but I found that the KL loss in your code was increasing

Interesting, did you observe this after a high number of epochs, or directly at the beginning? (in which case it could be a simple sign mistake, but it would be weird that I could sample nice cats by increasing the KL loss)

alexis-jacq avatar Aug 15 '18 20:08 alexis-jacq

It was increasing from the beginning. I think you did a good job and the code is quite clear. I think the way you calculated the kl loss was right and corresponding to the paper. But it was weird. Maybe you could sample nice cats because you just optimized the reconstruction loss? By the way, I think there's something wrong in this "*self.eta_step = 1-(1-hp.eta_min)hp.R". I think it should be "self.eta_step = 1-(1-hp.eta_min)*hp.R^step".

LJSthu avatar Aug 16 '18 01:08 LJSthu

@LJSthu I agree. I think there's a minor error in the code. Since both hp.eta_min and hp.R are constants, then eta_step will never be updated.

billstark avatar Oct 21 '18 14:10 billstark