Pytorch-Sketch-RNN
Pytorch-Sketch-RNN copied to clipboard
Loss became 'nan' after 10000~20000 epochs
Hello, thanks for the issue. I am not able to run so many epochs with my personal computer so I can't reproduce it. However, as the loss is the log of a probability that is minimalized, it may be possible to obtain some NaN errors when the probability is minimized.
One solution is to put log(P + epsilon) with epsilon a positive constant. I think I already do that, but maybe you can increase epsilon.
Anyway, you may not need that huge number of epochs, the model already generate very acceptable drawings after 4000 epochs
Thanks for your reply. I am training the network on the kanji dataset, and I found the results are not acceptable even after 20000 epochs. Have you ever met this problem?
I did not try these data. It's a harder task since if you miss a segment or if an angle is not right, you miss the letter, but I think that David Ha successfully trained sketchrnn for kanji. Another thing you can try is to change the dropout factor (for the cat drawings I used p=0.9, but it should be much smaller, like 0.1). Let me know.
Ps : can you show an example of non-acceptable sample and the respective target?
I ran your code and the author's code on kanji dataset, but I found that the KL loss in your code was increasing while it was decreasing in the author's output.
but I found that the KL loss in your code was increasing
Interesting, did you observe this after a high number of epochs, or directly at the beginning? (in which case it could be a simple sign mistake, but it would be weird that I could sample nice cats by increasing the KL loss)
It was increasing from the beginning. I think you did a good job and the code is quite clear. I think the way you calculated the kl loss was right and corresponding to the paper. But it was weird. Maybe you could sample nice cats because you just optimized the reconstruction loss? By the way, I think there's something wrong in this "*self.eta_step = 1-(1-hp.eta_min)hp.R". I think it should be "self.eta_step = 1-(1-hp.eta_min)*hp.R^step".
@LJSthu I agree. I think there's a minor error in the code. Since both hp.eta_min and hp.R are constants, then eta_step will never be updated.