Transformer-OCR icon indicating copy to clipboard operation
Transformer-OCR copied to clipboard

Why the recognition accuracy different from paper?

Open zobeirraisi opened this issue 4 years ago • 12 comments

I applied the pre-trained model on ICDAR15 datasets, but the results are different from the reported ones in the paper?

zobeirraisi avatar Mar 21 '20 21:03 zobeirraisi

Hi @zobeirraisi

I am also interested in this work. It'd be greatly appreciated if you can post the results on datasets that you have tried.

Jyouhou avatar Mar 22 '20 00:03 Jyouhou

Hi @zobeirraisi

I am also interested in this work. It'd be greatly appreciated if you can post the results on datasets that you have tried.


Hi @Jyouhou This is my results for ICDAR15 dataset: Link

zobeirraisi avatar Mar 22 '20 00:03 zobeirraisi

Thanks @zobeirraisi So the actual accuracy is ~71% We can wait for responses from the authors

Jyouhou avatar Mar 22 '20 00:03 Jyouhou

There are label nosie in IC15 test set, and I have relabeled.

fengxinjie avatar Mar 22 '20 00:03 fengxinjie

Hi @zobeirraisi I am also interested in this work. It'd be greatly appreciated if you can post the results on datasets that you have tried.

Hi @Jyouhou This is my results for ICDAR15 dataset: Link

I checked my prediction result, I don't know why our result different. For example, word_26_00.png##Kappa##Kappa## word_27_00.png##CAUTION##CAUTION## word_50_00.png##l:HOU##:HOU## ... are all corrent in my prediction.

fengxinjie avatar Mar 22 '20 01:03 fengxinjie

Hi @zobeirraisi I am also interested in this work. It'd be greatly appreciated if you can post the results on datasets that you have tried.

Hi @Jyouhou This is my results for ICDAR15 dataset: Link

I think you should crop the test image by coords.txt first, then predict.

fengxinjie avatar Mar 22 '20 02:03 fengxinjie

@Jyouhou @zobeirraisi Hi, can you tell us more about your pretrained model

li10141110 avatar Mar 27 '20 10:03 li10141110

According to my guess, the performance of this implementation should be 85% on IIIT-5K.

delveintodetail avatar Mar 31 '20 09:03 delveintodetail

@delveintodetail have you trained. the developer didnot reply clearly in the matter of training. whether he crops the icdar words, or what....

It is not because of the data preprocessing, the evaluation of this code is wrong.

delveintodetail avatar Apr 01 '20 01:04 delveintodetail

@delveintodetail Is there wrong in the predict.py file?

li10141110 avatar Apr 01 '20 07:04 li10141110

I have been training this model on the ICDAR 2015 Word Recognition dataset (IC15) with no relabeling of the mislabeled data using the code provided.

In order to recognized all the characters in the datasets, the vocab used was: vocab = "<=,.+:;-!?$%#&*' ()@éÉ/\[]0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ>"+'"'+"´"+"΅"

If one keeps training and relies only on the loss on the test dataset, the model will overfit and I have obtained different models with 100% on the test dataset. This means that even the mislabeled data is reproduced exactly just as the human labeled it with the errors. (Note: the model is only trained on the training dataset! Never on the test dataset! Yet, the best models on the inference on the test datasets were saved as training progressed).

Typically, such models may have relatively poor performance on the training data itself: On testing data: Summary: # wrong: 0 # total: 2077 wrong 0.0% On training data: Summary: # wrong: 1959 # total: 4468 wrong 43.85%

Starting from scratch, training and saving only the models that improves both the inference performance on both the test data and the training data, then one can get results like this after 1533 epochs using batch_size = 64: on test data: Summary: #wrong: 11 #total: 2077 wrong 0.5% on training data: Summary: #wrong: 620 #total: 4468 wrong 13.9%

Inspection shows that some of these models give the same answer as the human on some of the mislabeled data, at least on the test dataset.

As training progresses and new models are saved, the inference performance particularly improves on the training dataset while more slowly improving on the testing dataset.

Thus this models seems an overkill on the ICDAR 2015 dataset and the mislabeling makes comparison difficult.


Update: The model continued training and these are the results: loss for test during training: 0.006546 loss for training data during training: 0.027809

inference on test data: Summary: #wrong: 0 #total: 2077 wrong 0.0% inference on training data: Summary: #wrong: 129 #total: 4468 wrong 2.887%

Other training and tests with synthetic images suggest that it does not generalize so well.

gussmith avatar Apr 21 '20 19:04 gussmith

The results above were obtained with the code provided as is. Since then, I realized from my results and reading others that nevertheless, there is apparently an error in the code, which essentially trains the network when the validation is run. It is part of the initial code provided in the Annotated Transformer that the authors refer to. see issue testloss would lead to model update on eval mode #7 #7

gussmith avatar Apr 23 '20 04:04 gussmith