Attention-OCR icon indicating copy to clipboard operation
Attention-OCR copied to clipboard

Testing , bad results even on training sample after convergence

Open Alexjap opened this issue 8 years ago • 15 comments

Right now by following the instructions on the Readme:

Training procedure seems to converge (perplexity around 1 on toy example), but when we test on the same data ( the toy example itself) the results are quite bad , Is anyone experiencing this behavior as well? I tried to look into the bucketing part of the code , i'm not sure why the bucketing in evaluation and in training differ but that doesn't seem to be the cause anyway ( tried with same bucketing and still bad results) The version of keras and tensorflow are the reccomended ones ( Keras 1.1.1 and tf 0.11.0)

Alexjap avatar Feb 15 '17 04:02 Alexjap

I solved it with the following modification in model.py, line 352, without any explanations of why it should be like that... I searched a lot...

#if not forward_only: if True: input_feed[K.learning_phase()] = 1 else: input_feed[K.learning_phase()] = 0

ddaue avatar Feb 22 '17 18:02 ddaue

Yeah it would work on the training data but i don't think it can be considered a fix, by setting learning phase to 1 always means that we are in training mode, so any layer that has a different behavior in train/test will be set to train even if we are testing .

Alexjap avatar Feb 23 '17 06:02 Alexjap

Yes. If you are setting that flag to 1 during test phase, it basicaly means when you receive a test batch, you are doing the same thing as in training: subtracting some mean over the test set. While that's not that inconsitent between training and testing, doing that is kind of unfair, since presumably we should only use a test point's own information to classify that, without looking at some statistics over a batch of test examples. Sorry that I'm busy for a ddl, will look into the code later.

da03 avatar Feb 24 '17 03:02 da03

because of the difference of BN between training and testing

seed93 avatar Feb 27 '17 11:02 seed93

I trained the model with step perplexity = 1.006652, error = 0.0082. Then tried to test results using svt and iiit5k dataset. But for both dataset i got 100% incorrect results, which is totally unexpected. So, i used the trained model given, but still got same results.

i use keras 1.1.1 and Tf 0.12.1. I used distance as well and tried other datasets as well. Any help? This was an important project for me, please help.

NourozR avatar Mar 03 '17 10:03 NourozR

remove tf.gfile.Exists(ckpt.model_checkpoint_path) from model.py.

shrazo avatar Mar 04 '17 02:03 shrazo

I meet the same issue with Alexjap. Could anyone find the root cause? keras version: 1.1.1 tensorflow version: 0.12.1 Windows 10 I just created 3 pictures for 'a','b','c' and trained them. The picture's size is 31*31. I tested on these same 3 pictures. The result are bad,too. If I modify the code according the below, the test result is ok. #if not forward_only: if True: input_feed[K.learning_phase()] = 1 else: input_feed[K.learning_phase()] = 0

Train result: 2017-03-08 16:35:58,463 root INFO step_time: 1.881249, step_loss: 0.001654, step perplexity: 1.001656 2017-03-08 16:35:58,469 root INFO current_step: 198 2017-03-08 16:36:00,323 root INFO step_time: 1.854232, step_loss: 0.001635, step perplexity: 1.001637 2017-03-08 16:36:00,329 root INFO current_step: 199 2017-03-08 16:36:02,229 root INFO step_time: 1.900263, step_loss: 0.001617, step perplexity: 1.001618 2017-03-08 16:36:02,679 root INFO global step 200 step-time 1.91 loss 0.156341 perplexity 1.17 2017-03-08 16:36:02,679 root INFO Saving model, current_step: 200

Test result: 2017-03-08 16:37:50,221 root INFO Reading model parameters from ./results/model\translate.ckpt-200 2017-03-08 16:38:00,177 root INFO model is established and start to launch model 2017-03-08 16:38:00,178 root INFO start to test 2017-03-08 16:38:00,178 root INFO Compare word based on edit distance. 2017-03-08 16:38:00,844 root INFO step_time: 0.598397, loss: 1.272859, step perplexity: 3.571049 2017-03-08 16:38:00,847 root INFO 0.000000 out of 1 correct 2017-03-08 16:38:01,183 root INFO step_time: 0.335222, loss: 2.004660, step perplexity: 7.423572 2017-03-08 16:38:01,185 root INFO 0.000000 out of 2 correct 2017-03-08 16:38:01,494 root INFO step_time: 0.308204, loss: 1.537001, step perplexity: 4.650624 2017-03-08 16:38:01,496 root INFO 0.000000 out of 3 correct

raoweijin avatar Mar 09 '17 06:03 raoweijin

I think what seed93 said might make sense, maybe it is related to the Batch normalisation behavior but i didn't have time to test without it to see if things change. In the CNN part of the model(keras code) we should try to remove the Batch normalisation layers and try training again to see if things change , i currently don't have access to a proper machine to try this out and a bit busy with stuff. the test would be to try comment out all the model.add(layers.BatchNormalization(axis=1)) in the cnn.py file, retrain and see if the testing on training data is consistent. By removing the batch normalisation we should expect a slower convergence during training but it would be fine to check if it's actually the BN that breaks the model in testing

Alexjap avatar Mar 09 '17 07:03 Alexjap

@Alexjap I use the pull requests code and found this bug. Change cnn_model = CNN(self.img_data, True) #(not self.forward_only)) to cnn_model = CNN(self.img_data, not self.forward_only)

seed93 avatar Mar 09 '17 07:03 seed93

@seed93 i quickly checked the code you mentioned, if we change the code like that we set the CNN model in testing(freeze weights) when we are training and vice versa, it looks a bit strange to me

Alexjap avatar Mar 09 '17 10:03 Alexjap

i looked into the code and found 'false' argument in "model.py" (line 204-211) while debugging. The problem was actually the system was unable to load the trained model. So, i edited the code little bit and found that now the model trained by me is loaded and it's working. But still the accuracy is low (12-15% for both svt & iiit5k test dataset). The problem is in this argument : "batchnormalization_3_running_mean:0 NOT trainable" , batchnormalization_3_running_std:0 NOT trainable" . This happened because: new tf & keras version can't calculate mean & standard deviation from these two arguments. So does for pre-trained model. And since models are binary files, there is no room to change them.

Also, in test phase, the system is giving accurate results for first input of a mini-batch but not for rest of data. This was strange to me.

NourozR avatar Mar 14 '17 18:03 NourozR

@raoweijin , i faced same problem and somehow solved it with this: remove tf.gfile.Exists(ckpt.model_checkpoint_path) from model.py .. @shraju024 is right.

NourozR avatar Mar 14 '17 18:03 NourozR

Solved this problem with https://github.com/SivanKe/Attention-OCR/pull/1 as suggested by seed93

jvpoulos avatar Apr 07 '17 15:04 jvpoulos

@NourozR Now I face same problen,“remove tf.gfile.Exists(ckpt.model_checkpoint_path)”can solve the problem?This method just loads the model。

zj463261929 avatar May 04 '17 05:05 zj463261929

Hi Guys,

Please help me. While training the code with test data. I am getting generating first batch. It is not going showing the step train and step loss :(. I gave all the parameters mentioned in the training steps.

Epoch ........ 0 2018-05-20 08:21:01,333 root INFO Generating first batch) Epoch ........ 1 2018-05-20 08:21:04,836 root INFO Generating first batch) Epoch ........ 2 2018-05-20 08:21:08,310 root INFO Generating first batch) Epoch ........ 3 2018-05-20 08:21:11,780 root INFO Generating first batch) Epoch ........ 4

balajiwix avatar May 20 '18 08:05 balajiwix