NTM-tensorflow icon indicating copy to clipboard operation
NTM-tensorflow copied to clipboard

Loss sometimes goes to nan even with the gradient clipping

Open carpedm20 opened this issue 9 years ago • 6 comments

Didn't figure out why yet and any advice for this is welcome!

carpedm20 avatar Dec 31 '15 22:12 carpedm20

not sure if it’s related but do softmax(xxxx + eps)

On 31 Dec 2015, at 23:13, Taehoon Kim [email protected] wrote:

Didn't figure out why yet and any advice for this is welcome!

— Reply to this email directly or view it on GitHub https://github.com/carpedm20/NTM-tensorflow/issues/2.

jli05 avatar Dec 31 '15 22:12 jli05

@jli05 Thanks! I'll try it. I could only learn NTM with max_length=10 since now without nan loss. If it becomes more than 10, I think we need more than 100000 epochs which is different from referenced code.

carpedm20 avatar Dec 31 '15 22:12 carpedm20

@carpedm20 in my NTM implementation (and in a couple of others I saw out there) nans were usually caused by one of the following:

  • Initializing the memory to zero. The memory appears in the denominator of the cosine distance and that makes it nan. Check if that is not your case and possibly add a small constant in the denominator and avoid initializing the memory to all zeros (make it a small constant).
  • negative sharpening value. that creates a complex number and also makes the cost function go nan

I think there was a third case but I don't remember right now. Good luck debugging! :D

EderSantana avatar Apr 29 '16 01:04 EderSantana

@EderSantana could you explain what is the meaning of negative sharpening value? Thanks

lixiangnlp avatar Jan 05 '17 01:01 lixiangnlp

the sharpening value is is uses as pow(input, sharpening). So it can't be negative. Use a nonlinearity like softplus to avoid getting negative values: sharpening = tf.nn.softplus(sharpening).

EderSantana avatar Mar 14 '17 21:03 EderSantana

Having a negative sharpening value wouldn't make a real become imaginary. But in the paper Graves explicitly states that the sharpening value is >= 1, so softplus(gamma) + 1 would work fine.

a^(-b) = 1/(a^b)

therealjtgill avatar Mar 21 '17 23:03 therealjtgill