tesstrain Tesseract prints characters differ from lstmeval

My system info:

OS: Ubuntu Desktop 18.04 LTS (4.15.0-55-generic)

Hi.

I am beginner and am trying to train some Korean character images for Korean recognition.

To understand how to train with Tesseract 4.0 LSTM, I have trained my data from scratch by following lines of Makefile in this Tesstrain step by step, and most of steps seemed to work fine until creating traineddata.

These steps are what I did until now. I manually followed the steps instead of running make:

I made box files and unicharset by following this lines.
I made lstmf files by following this lines.
I made two split file lists for training and evaluation by following this lines.
Before combining lang model, I downloaded radical-stroke.txt by following this line, and 3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from this link.

I didn't download kor.config file because it cause an error that chi_tra.traineddata is needed.
I combined lang model by following this lines.
Then I started LSTM training by following this lines.
I tested them. The results are like:

lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmeval --traineddata data/kor/kor.traineddata --model data/kor/checkpoints/kor_checkpoint --eval_listfile data/kor/list.eval
data/kor/checkpoints/kor_checkpoint is not a recognition model, trying training checkpoint...
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp249.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp228.lstmf
Truth:먹
OCR  :이
Truth:독
OCR  :이
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp197.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp41.lstmf
Truth:파
OCR  :이
Truth:신
OCR  :열
... (skip)
At iteration 0, stage 0, Eval Char error rate=133.33333, Word error rate=96.875

There seems to be no problem with the results.

I know WER is abnormally high but I thought it doesn't matter because I just wanted to check whether the characters recognized by usr/bin/lstmeval are equal with the characters recognized by usr/bin/tesseract for a same image.

I made traineddata output file.

lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmtraining --stop_training \
--continue_from data/kor/checkpoints/kor_checkpoint \
--traineddata data/kor/kor.traineddata \
--model_output usr/share/tessdata/kor.traineddata

Then I used tesseract with kor.malgun.exp197.tif. the TIF file was shown to '이' when I followed step 7 (testing with lstmeval). So I expected the same result. lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result

But the real result was totally mess. As I was concerned, the recognized characters differed from each other.

Why the characters recognized by lstmeval and tesseract are different? Is it normal?

Thank you...

Oct 10 '19 13:10 ghwn

the characters recognized by lstmeval and tesseract are different

I can confirm this with a test for Devanagari:

Loaded 1/1 lines (1-1) of document data/deva-lstmf/162.deva1.Sanskrit_2003,.exp0.lstmf
Loaded 1/1 lines (1-1) of document data/deva-lstmf/2214.deva1.Aksharyogini.exp0.lstmf
Truth:गूहितुं चित्रांश कुक्कुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
OCR  :गृहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
Truth:। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा
OCR  :। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा
At iteration 0, stage 0, Eval Char error rate=2.8985507, Word error rate=14.285714

ubuntu@tesseract-ocr:~/tesstrain$ tesseract data/deva-boxtiff/162.deva1.Sanskrit_2003,.exp0.tif - -l devaLayer2.131 --tessdata-dir ./
Page 1
गूहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
ubuntu@tesseract-ocr:~/tesstrain$ tesseract data/deva-boxtiff/2214.deva1.Aksharyogini.exp0.tif - -l devaLayer2.131 --tessdata-dir ./
Page 1
। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा

lstmeval OCR:

OCR  :गृहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥

tesseract OCR:

गूहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥

Word which is different: गृहितुं vs गूहितुं

Oct 10 '19 15:10 Shreeshrii

Thank you!

Oct 10 '19 23:10 ghwn

@stweil Should this issue be kept open? Should lstmeval and tesseract give same output?

Oct 11 '19 03:10 Shreeshrii

Yes, I think this needs more examination.

Oct 11 '19 05:10 stweil

@stweil I bet that this is a PSM issue again. @JihwanLim @Shreeshrii Could you rerun your tests on the line level setting PSM to 13?

Oct 22 '19 12:10 wrznr

@wrznr I tried it with --psm 13 but it still gives same result. Only '^L' which is FORM FEED is shown in most cases.

Oct 25 '19 02:10 ghwn

@JihwanLim Could you provide us with a snippet from your test data? I'd like to reproduce your results...

Oct 25 '19 06:10 wrznr

@wrznr Sorry for making you wait.. I attached my Makefile. You can see new targets such as prepare, images, labels, fonts, and etc in the file but probably you don't need to care about them because they are just for generating new TIFF images. Makefile.zip

Oct 28 '19 00:10 ghwn

@JihwanLim Many thanks. I'll have a look into your data set within the week and get back to you here.

Oct 28 '19 07:10 wrznr

@wrznr Thank you!

Oct 28 '19 08:10 ghwn

@JihwanLim What I meant with snippet from your test data was a small set of image-text pairs in order to examine the deviant behavior of lstmeval.

(Even if I download hangul.ttf I am missing the scripts get_data.py and most importantly hangul-image-generator.py to reproduce your training setup.)

Oct 29 '19 07:10 wrznr

Okay I attached my project.

Extract tesstrain.tar.gz
Required packages are here.
Extract fonts.tar.gz and place malgun.ttf into tesstrain/fonts.
Build your Tesseract into tesstrain/usr.
Start from make unicharset. If you want only images and gt.txt files, enter make prepare.

And let me know if something is missing at any time, thank you. tesstrain.tar.gz fonts.tar.gz

Oct 29 '19 09:10 ghwn

Have you solved this problem? Since I met the same problem. I have a quite different OCR output compared with I use the "tesseract" command, using the same model. Could you please resond to me? Thanks

Feb 28 '20 13:02 MoleImg

No. We have not. Thanks for pinging. I try to find some time for it next week!

Feb 28 '20 15:02 wrznr

No. We have not. Thanks for pinging. I try to find some time for it next week!

Thanks. I'm struggling with this problem for such a long time, but cannot find the reason/solution. Can you help me with this? Thank you so much

Mar 01 '20 03:03 MoleImg

I think the problem is when you have to stop lstmtraining and convert to integer traineddata. you may have used traineddata generated by tesstrain.sh. you should better use traineddata best

lstmtraining
--stop_training
--convert_to_int
--continue_from ../tesstutorial/impact_from_full/impact_checkpoint
--traineddata tessdata/best/eng.traineddata
--model_output ../tesstutorial/impact_from_full/eng_impact_int.traineddata

May 19 '20 12:05 bilal-rachik

any updates on this? I have the same problem, lstmeval and tesseract with psm -13 and the same traineddata do not match.

Dec 02 '20 03:12 red-canoe

It seems I have the same issue on the training for couple of old russian glyphs (which makes it plus char training from russian). Actually this issue slaughters all the fun from the tesseract since I suspect that recognition with this bug fixed would be dramatically better. Can issue be prioritized, please?

By the way, are there any embedded debug support for the tesseract app which can be activated?

Dec 31 '20 00:12 dvrogozh

By the way, are there any embedded debug support for the tesseract app which can be activated?

yes, you can: build with debugging enabled and then enable any of the debug parameters you can see in tesseract --print-parameters (the most important of which is debug_file – must be non-empty to see any debug messages).

Jun 07 '21 21:06 bertsky

Why the characters recognized by lstmeval and tesseract are different? Is it normal?

Yes, it's not unlikely, since the latter is much more complex – e.g. because it contains image preprocessing, page segmentation, multi-model/lang and legacy engine code. The basic function is the same though:

lstmeval →
lstmtester →
lstmtrainer →
LSTMTrainer::PrepareForwardBackward →
LSTMRecognizer::RecognizeLine + LabelsFromOutputs
tesseractmain →
TessBaseAPI::ProcessPage →
TessBaseAPI::Recognize →
Tesseract::recog_all_words →
Tesseract::classify_word_and_language → Tesseract::classify_word_pass1 →
Tesseract::LSTMRecognizeWord →
LSTMRecognizer::RecognizeLine + LabelsFromOutputs

I concur with @wrznr in surmising this is a PSM issue, but the OP already refuted that by trying PSM 13 to no effect.

@bilal-rachik brings in model finalization (esp. float→int conversion) which could play a role, esp. since the differences IIUC are rather small. Can anyone confirm this by trying without --convert_to_int?

It could also be related to thresholding or image normalization...

Jun 07 '21 22:06 bertsky

@bilal-rachik @bertsky Is this really a tesstrain issue?

Sep 08 '21 06:09 wrznr

Is this really a tesstrain issue?

You are right, this should probably be transferred to the tesseract repo.

Sep 08 '21 06:09 bertsky

is there any update here? I'm having this issue where I'm using eng.traineddata and I'm getting accurate results for some test .png's using tesseract, but nonsense using lstmeval. This is messing up my training. I'm wondering if, like mentioned above, I have some configs set incorrectly

Nov 18 '21 16:11 jhartungBE

@jhartungBE all we have at this point are suspicions (what to look for). Have you tried …

PSM=13 / --psm 13
with traineddata from tessdata_best / without --convert_to_int
with/out adding language models, or
with --configfile <(echo thresholding_method 2) / -c thresholding_method=2

… yet?

Nov 18 '21 18:11 bertsky

Thanks for the quick response. Yes I have tried the first two options, but not sure what you mean on the latter two. Here's a simple example that explains my issue. I have these two example image/text pairs in test-ground-truth. I can generate the box/lstmf files using "make lists MODEL_NAME=test PSM=7". I can then run "lstmeval --model eng_tessdata_best/eng.traineddata --eval_listfile data/test/all-lstmf" and I get

Loaded 1/1 lines (1-1) of document data/test-ground-truth/Message_20210917_163.lstmf
Loaded 1/1 lines (1-1) of document data/test-ground-truth/Message_20211012_281.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Truth:any 19.25 legs?
OCR  :L
Truth:I got nothing better then 1
OCR  :BE te e  c
At iteration 0, stage 0, Eval Char error rate=94.444444, Word error rate=100

When I run in my tesseract repo, using, for example: tesseract Message_20211012_281.png test.txt --psm 7 I get perfect match of I got nothing better then 1

Message_20211012_281.gt.txt Message_20210917_163.gt.txt

Nov 18 '21 19:11 jhartungBE

@jhartungBE, like I said in my first comment, the Tesseract standalone CLI has much more than just the bare recognition of lstmeval – and that includes a check and compensation for inverse colours, like in your example.

So that's another issue (in fact, it's no issue IMO).

Nov 18 '21 20:11 bertsky

Okay understood. Thank you. However, I'm failing to understand how I can train tesseract if this is the case and lstm training doesn't really apply to my images the same way that the tesseract engine will? Does this just mean I have to modify my images to pass to lstm training so that they are received the same way the LSTMRecognizer will receive them when I'm running tesseract?

Nov 18 '21 21:11 jhartungBE

Yes, that's what it means. Just install ImageMagick and do a convert input.png -negate output.png

Nov 18 '21 21:11 bertsky

Great, thanks. Appreciate your help

Nov 18 '21 21:11 jhartungBE

tesstrain tesstrain copied to clipboard

Tesseract prints characters differ from lstmeval

tesstrain
tesstrain copied to clipboard