see icon indicating copy to clipboard operation
see copied to clipboard

progress stop at 99.99% of this epoch

Open qnkhuat opened this issue 6 years ago • 13 comments

screen shot 2018-03-18 at 6 23 40 pm its stayed like this for 10m. Is there any problem with it ?

qnkhuat avatar Mar 18 '18 11:03 qnkhuat

No, there is no problem. The program is performing a validation of your trained model on the entire validation dataset. This will take a while. Everything is alright.

Bartzi avatar Mar 19 '18 09:03 Bartzi

I'm also seeing an nan in your train log. Did you adjust the learning rate to a lower value?

Bartzi avatar Mar 19 '18 09:03 Bartzi

I didn't change anything. btw I encourage the same problem in another remote. the whole losses is nan. screen shot 2018-03-19 at 9 14 13 pm

qnkhuat avatar Mar 19 '18 14:03 qnkhuat

In this case you definitely need to adjust to learning rate to 1e-4 or 1e-5.

Bartzi avatar Mar 19 '18 15:03 Bartzi

it didn't help. I still receive nan

qnkhuat avatar Mar 20 '18 07:03 qnkhuat

it could be that a division by zero occurs somewhere... If adjusting the learning rate does not help, you could check for that and use chainer in debug mode.

Bartzi avatar Mar 20 '18 09:03 Bartzi

Its yielded : Exception in main training loop: Each label t need to satisfy 0 <= t < x.shape[1] or t == -1; Concretely: screen shot 2018-03-21 at 12 42 19 am

It is funny that I used debug mode on another machine (which don't have nan loss) it also yields the same.

qnkhuat avatar Mar 20 '18 17:03 qnkhuat

Seems the shapes produced by the network are not as they should be. Are you using your own data?

Bartzi avatar Mar 21 '18 13:03 Bartzi

Yes. I've created my own data. I trained it on another machine and it doesn't get the nan. But it stuck at 99.96% for a day :D

qnkhuat avatar Mar 21 '18 16:03 qnkhuat

Then you should check the number of classes your dataset has. Did you adjust the network, to fit to your number of classes?

How large is your validation set?

Bartzi avatar Mar 21 '18 16:03 Bartzi

I need to detect 1 text with 17 chars.

17 1 $PATH 1GCHTCFE4C8101563 Example of my gt

My validation set is 120mb(3700 images). Is it too big?

qnkhuat avatar Mar 21 '18 16:03 qnkhuat

How many different characters do you want to recognize?

3700 images is not to much for validation. Actually it should work... I'm not sure why it doesn't. You can, however, just uncomment the epoch evaluator from the training script and then this should not be a problem anymore.

Bartzi avatar Mar 21 '18 17:03 Bartzi

yea. But it still receives nan :(.

qnkhuat avatar Mar 23 '18 16:03 qnkhuat