ba-dls-deepspeech
ba-dls-deepspeech copied to clipboard
Loss becomes nan after a while

Why is this happening? And how to solve it?
Try using the Keras optimizer rather than Lasagne's.
What would be the exact changes in the code?
Same here, I have edited model.py :
Comment import lasagne
Uncomment from keras.optimizers import SGD
Comment grads = lasagne.updates.total_norm_constraint...
Comment updates = lasagne.updates.nesterov_momentum...
Uncomment optimizer = SGD(nesterov=True, lr=learning_rate,...
Uncomment updates = optimizer.get_updates(...
I'm seeing this issue too. I switched to using the Keras optimizer instead of Lasagne's, making the same changes that @aglotero cited above.
For the first 8990 (out of 12188) iterations the loss function was working properly. Then it looks like starting at iteration 9000 I started seeing the nan
...
2016-12-07 04:38:33,080 INFO (__main__) Epoch: 0, Iteration: 8960, Loss: 148.405151367
2016-12-07 04:40:38,369 INFO (__main__) Epoch: 0, Iteration: 8970, Loss: 356.538299561
2016-12-07 04:42:43,709 INFO (__main__) Epoch: 0, Iteration: 8980, Loss: 382.034057617
2016-12-07 04:44:49,189 INFO (__main__) Epoch: 0, Iteration: 8990, Loss: 310.213592529
2016-12-07 04:58:47,111 INFO (__main__) Epoch: 0, Iteration: 9000, Loss: nan
Interestingly, the loss spiked at iteration 8960. Here is the plot for the first 9000 iterations.
Some notes: I am using dropout on the RNN layers hence the plot, and I increased the data size being trained on by upping the max duration to 15.0 seconds. My mini batch size is 24.
Using the SGM optimizer with the clipnorm=1
option may be a solution.
I was gettin a nan cost at 400th iteration, now I'm at the 3690th iteration, and running.
I saw some similar issue at https://github.com/fchollet/keras/issues/1244
FWIW I fixed this by dropping the learning rate and removing the dropout layers I added. I left the clipnorm value at 100.
Hi!
I change the clipnorm to 1 like @aglotero said, but with more gru layers(1 convolutional, 7 gru with 1000 nodes each, 1 full connection).
I found that the loss was converging but stuck at about 300, and the visualize result for testing is really bad!
Is that mean the structure is not deep enough? Or I should train more epochs?
Thanks!
@a00achild1 I don't think you want clipnorm set to 1. Were you getting NaNs before with the clipnorm set to a higher val (~100)?
@dylanbfox thanks for quick response! When the clipnorm is the default value(100), i get NaNs after some iterations. Then I came to read this issue, and try to train the model with clipnorm 1.
What do you mean you don't think setting clipnorm to 1 is a good idea? Is the performance affected by the small clipnorm value?
What is your learning rate? Trying dropping that and keeping the clipnorm higher.
@dylanbfox my learning rate is 2e-4, the default value. In my experience this is really small. Maybe I am wrong. I will try smaller value and keeping the clipnorm higher. Thanks
I set the learning rate to 2e-4, clipnorm back to 100, training with LibriSpeech-clean-100, and my model structure is 1 convolutional, 7 gru with 1000 nodes each, 1 full connection, which is reference to Baidu's paper.
While the training loss seemed dropping down continuously, the validation loss started to diverge. The prediction of a testing file is better than before, but it still can't predict a correct vocabulary. Did anyone has trained a good model for speech recognition or has some suggestion? Any suggestion will be really appreciated!
P.S. Could the problem still on the clipnorm? I've been searching for a while, but it seems there isn't a good approach to determine the clip value.
I have a similar problem as the other threads described. But my model has NaN value after 1st iteration. I tried chaging an optimizer function (Keras and Lasagne), a clipnorm (1 and 100), and a Learning rate (2e-4 and 0.01), but it still has NaN cost function value. Is there anyone who can advise about this problem? I would really appreciate if you guys give a solution for this. I used Keras-1.0.7 and Theano-rel-0.8.2. If you think this version is not appropriate, please let me know.
Ex. Keras , learning_rate=2e-4, clipnorm = 1 2017-01-09 01:27:52,611 INFO (main) Epoch: 0, Iteration: 0, Loss: 241.261184692 2017-01-09 01:28:00,360 INFO (main) Epoch: 0, Iteration: 1, Loss: nan 2017-01-09 01:28:07,864 INFO (main) Epoch: 0, Iteration: 2, Loss: nan 2017-01-09 01:28:15,374 INFO (main) Epoch: 0, Iteration: 3, Loss: nan 2017-01-09 01:28:23,191 INFO (main) Epoch: 0, Iteration: 4, Loss: nan 2017-01-09 01:28:31,301 INFO (main) Epoch: 0, Iteration: 5, Loss: nan 2017-01-09 01:28:39,587 INFO (main) Epoch: 0, Iteration: 6, Loss: nan 2017-01-09 01:28:48,127 INFO (main) Epoch: 0, Iteration: 7, Loss: nan 2017-01-09 01:28:56,824 INFO (main) Epoch: 0, Iteration: 8, Loss: nan 2017-01-09 01:29:05,442 INFO (main) Epoch: 0, Iteration: 9, Loss: nan 2017-01-09 01:29:14,783 INFO (main) Epoch: 0, Iteration: 10, Loss: nan 2017-01-09 01:29:23,937 INFO (main) Epoch: 0, Iteration: 11, Loss: nan
I just found a solution! We should use Keras-1.1.0 or above version for Keras package.
@a00achild1 Hey. Did you find out why your graph turns into that? I'm currently at that stage.