see
see copied to clipboard
progress stop at 99.99% of this epoch

No, there is no problem. The program is performing a validation of your trained model on the entire validation dataset. This will take a while. Everything is alright.
I'm also seeing an nan
in your train log. Did you adjust the learning rate to a lower value?
I didn't change anything. btw I encourage the same problem in another remote. the whole losses is nan.
In this case you definitely need to adjust to learning rate to 1e-4
or 1e-5
.
it didn't help. I still receive nan
it could be that a division by zero occurs somewhere... If adjusting the learning rate does not help, you could check for that and use chainer in debug mode.
Its yielded : Exception in main training loop: Each label t
need to satisfy 0 <= t < x.shape[1] or t == -1
;
Concretely:
It is funny that I used debug mode on another machine (which don't have nan loss) it also yields the same.
Seems the shapes produced by the network are not as they should be. Are you using your own data?
Yes. I've created my own data. I trained it on another machine and it doesn't get the nan. But it stuck at 99.96% for a day :D
Then you should check the number of classes your dataset has. Did you adjust the network, to fit to your number of classes?
How large is your validation set?
I need to detect 1 text with 17 chars.
17 1 $PATH 1GCHTCFE4C8101563
Example of my gt
My validation set is 120mb(3700 images). Is it too big?
How many different characters do you want to recognize?
3700 images is not to much for validation. Actually it should work... I'm not sure why it doesn't. You can, however, just uncomment the epoch evaluator from the training script and then this should not be a problem anymore.
yea. But it still receives nan :(.