see progress stop at 99.99% of this epoch

screen shot 2018-03-18 at 6 23 40 pm its stayed like this for 10m. Is there any problem with it ?

Mar 18 '18 11:03 qnkhuat

No, there is no problem. The program is performing a validation of your trained model on the entire validation dataset. This will take a while. Everything is alright.

Mar 19 '18 09:03 Bartzi

I'm also seeing an nan in your train log. Did you adjust the learning rate to a lower value?

Mar 19 '18 09:03 Bartzi

I didn't change anything. btw I encourage the same problem in another remote. the whole losses is nan. screen shot 2018-03-19 at 9 14 13 pm

Mar 19 '18 14:03 qnkhuat

In this case you definitely need to adjust to learning rate to 1e-4 or 1e-5.

Mar 19 '18 15:03 Bartzi

it didn't help. I still receive nan

Mar 20 '18 07:03 qnkhuat

it could be that a division by zero occurs somewhere... If adjusting the learning rate does not help, you could check for that and use chainer in debug mode.

Mar 20 '18 09:03 Bartzi

Its yielded : Exception in main training loop: Each label t need to satisfy 0 <= t < x.shape[1] or t == -1; Concretely: screen shot 2018-03-21 at 12 42 19 am

It is funny that I used debug mode on another machine (which don't have nan loss) it also yields the same.

Mar 20 '18 17:03 qnkhuat

Seems the shapes produced by the network are not as they should be. Are you using your own data?

Mar 21 '18 13:03 Bartzi

Yes. I've created my own data. I trained it on another machine and it doesn't get the nan. But it stuck at 99.96% for a day :D

Mar 21 '18 16:03 qnkhuat

Then you should check the number of classes your dataset has. Did you adjust the network, to fit to your number of classes?

How large is your validation set?

Mar 21 '18 16:03 Bartzi

I need to detect 1 text with 17 chars.

17 1 $PATH 1GCHTCFE4C8101563 Example of my gt

My validation set is 120mb(3700 images). Is it too big?

Mar 21 '18 16:03 qnkhuat

How many different characters do you want to recognize?

3700 images is not to much for validation. Actually it should work... I'm not sure why it doesn't. You can, however, just uncomment the epoch evaluator from the training script and then this should not be a problem anymore.

Mar 21 '18 17:03 Bartzi

yea. But it still receives nan :(.

Mar 23 '18 16:03 qnkhuat

see see copied to clipboard

progress stop at 99.99% of this epoch

see
see copied to clipboard