Pytorch-Sketch-RNN Loss became 'nan' after 10000~20000 epochs

Jul 16 '18 12:07 LJSthu

Hello, thanks for the issue. I am not able to run so many epochs with my personal computer so I can't reproduce it. However, as the loss is the log of a probability that is minimalized, it may be possible to obtain some NaN errors when the probability is minimized.

One solution is to put log(P + epsilon) with epsilon a positive constant. I think I already do that, but maybe you can increase epsilon.

Anyway, you may not need that huge number of epochs, the model already generate very acceptable drawings after 4000 epochs

Jul 17 '18 10:07 alexis-jacq

Thanks for your reply. I am training the network on the kanji dataset, and I found the results are not acceptable even after 20000 epochs. Have you ever met this problem?

Jul 17 '18 12:07 LJSthu

I did not try these data. It's a harder task since if you miss a segment or if an angle is not right, you miss the letter, but I think that David Ha successfully trained sketchrnn for kanji. Another thing you can try is to change the dropout factor (for the cat drawings I used p=0.9, but it should be much smaller, like 0.1). Let me know.

Ps : can you show an example of non-acceptable sample and the respective target?

Jul 17 '18 17:07 alexis-jacq

I ran your code and the author's code on kanji dataset, but I found that the KL loss in your code was increasing while it was decreasing in the author's output.

Aug 15 '18 13:08 LJSthu

but I found that the KL loss in your code was increasing

Interesting, did you observe this after a high number of epochs, or directly at the beginning? (in which case it could be a simple sign mistake, but it would be weird that I could sample nice cats by increasing the KL loss)

Aug 15 '18 20:08 alexis-jacq

It was increasing from the beginning. I think you did a good job and the code is quite clear. I think the way you calculated the kl loss was right and corresponding to the paper. But it was weird. Maybe you could sample nice cats because you just optimized the reconstruction loss? By the way, I think there's something wrong in this "*self.eta_step = 1-(1-hp.eta_min)hp.R". I think it should be "self.eta_step = 1-(1-hp.eta_min)*hp.R^step".

Aug 16 '18 01:08 LJSthu

@LJSthu I agree. I think there's a minor error in the code. Since both hp.eta_min and hp.R are constants, then eta_step will never be updated.

Oct 21 '18 14:10 billstark

Pytorch-Sketch-RNN Pytorch-Sketch-RNN copied to clipboard

Loss became 'nan' after 10000~20000 epochs

Pytorch-Sketch-RNN
Pytorch-Sketch-RNN copied to clipboard