tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Tesseract prints characters differ from lstmeval

Open ghwn opened this issue 5 years ago • 29 comments

My system info:

  • OS: Ubuntu Desktop 18.04 LTS (4.15.0-55-generic)

Hi.

I am beginner and am trying to train some Korean character images for Korean recognition.

To understand how to train with Tesseract 4.0 LSTM, I have trained my data from scratch by following lines of Makefile in this Tesstrain step by step, and most of steps seemed to work fine until creating traineddata.

These steps are what I did until now. I manually followed the steps instead of running make:

  1. I made box files and unicharset by following this lines.

  2. I made lstmf files by following this lines.

  3. I made two split file lists for training and evaluation by following this lines.

  4. Before combining lang model, I downloaded radical-stroke.txt by following this line, and 3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from this link.

    I didn't download kor.config file because it cause an error that chi_tra.traineddata is needed.

  5. I combined lang model by following this lines.

  6. Then I started LSTM training by following this lines.

  7. I tested them. The results are like:

lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmeval --traineddata data/kor/kor.traineddata --model data/kor/checkpoints/kor_checkpoint --eval_listfile data/kor/list.eval
data/kor/checkpoints/kor_checkpoint is not a recognition model, trying training checkpoint...
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp249.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp228.lstmf
Truth:먹
OCR  :이
Truth:독
OCR  :이
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp197.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp41.lstmf
Truth:파
OCR  :이
Truth:신
OCR  :열
... (skip)
At iteration 0, stage 0, Eval Char error rate=133.33333, Word error rate=96.875

There seems to be no problem with the results.

I know WER is abnormally high but I thought it doesn't matter because I just wanted to check whether the characters recognized by usr/bin/lstmeval are equal with the characters recognized by usr/bin/tesseract for a same image.

  1. I made traineddata output file.
lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmtraining --stop_training \
--continue_from data/kor/checkpoints/kor_checkpoint \
--traineddata data/kor/kor.traineddata \
--model_output usr/share/tessdata/kor.traineddata
  1. Then I used tesseract with kor.malgun.exp197.tif. the TIF file was shown to '이' when I followed step 7 (testing with lstmeval). So I expected the same result. lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result

But the real result was totally mess. As I was concerned, the recognized characters differed from each other.

Why the characters recognized by lstmeval and tesseract are different? Is it normal?

Thank you...

ghwn avatar Oct 10 '19 13:10 ghwn

the characters recognized by lstmeval and tesseract are different

I can confirm this with a test for Devanagari:

Loaded 1/1 lines (1-1) of document data/deva-lstmf/162.deva1.Sanskrit_2003,.exp0.lstmf
Loaded 1/1 lines (1-1) of document data/deva-lstmf/2214.deva1.Aksharyogini.exp0.lstmf
Truth:गूहितुं चित्रांश कुक्कुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
OCR  :गृहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
Truth:। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा
OCR  :। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा
At iteration 0, stage 0, Eval Char error rate=2.8985507, Word error rate=14.285714

ubuntu@tesseract-ocr:~/tesstrain$ tesseract data/deva-boxtiff/162.deva1.Sanskrit_2003,.exp0.tif - -l devaLayer2.131 --tessdata-dir ./
Page 1
गूहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥
ubuntu@tesseract-ocr:~/tesstrain$ tesseract data/deva-boxtiff/2214.deva1.Aksharyogini.exp0.tif - -l devaLayer2.131 --tessdata-dir ./
Page 1
। ददाशदस्मै दारिद्र्याद्ध्रियम् यकृत्कोपः महाराज धिष्ण्येमे सर्वा

lstmeval OCR:

OCR  :गृहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥

tesseract OCR:

गूहितुं चित्रांश कुकुटनाडीयन्त्र असम्भवस्तु सतोऽनुपपत्तेः ॐ ॥२.३.९॥

Word which is different: गृहितुं vs गूहितुं

Shreeshrii avatar Oct 10 '19 15:10 Shreeshrii

Thank you!

ghwn avatar Oct 10 '19 23:10 ghwn

@stweil Should this issue be kept open? Should lstmeval and tesseract give same output?

Shreeshrii avatar Oct 11 '19 03:10 Shreeshrii

Yes, I think this needs more examination.

stweil avatar Oct 11 '19 05:10 stweil

@stweil I bet that this is a PSM issue again. @JihwanLim @Shreeshrii Could you rerun your tests on the line level setting PSM to 13?

wrznr avatar Oct 22 '19 12:10 wrznr

@wrznr I tried it with --psm 13 but it still gives same result. Only '^L' which is FORM FEED is shown in most cases.

ghwn avatar Oct 25 '19 02:10 ghwn

@JihwanLim Could you provide us with a snippet from your test data? I'd like to reproduce your results...

wrznr avatar Oct 25 '19 06:10 wrznr

@wrznr Sorry for making you wait.. I attached my Makefile. You can see new targets such as prepare, images, labels, fonts, and etc in the file but probably you don't need to care about them because they are just for generating new TIFF images. Makefile.zip

ghwn avatar Oct 28 '19 00:10 ghwn

@JihwanLim Many thanks. I'll have a look into your data set within the week and get back to you here.

wrznr avatar Oct 28 '19 07:10 wrznr

@wrznr Thank you!

ghwn avatar Oct 28 '19 08:10 ghwn

@JihwanLim What I meant with snippet from your test data was a small set of image-text pairs in order to examine the deviant behavior of lstmeval.

(Even if I download hangul.ttf I am missing the scripts get_data.py and most importantly hangul-image-generator.py to reproduce your training setup.)

wrznr avatar Oct 29 '19 07:10 wrznr

Okay I attached my project.

  1. Extract tesstrain.tar.gz
  2. Required packages are here.
  3. Extract fonts.tar.gz and place malgun.ttf into tesstrain/fonts.
  4. Build your Tesseract into tesstrain/usr.
  5. Start from make unicharset. If you want only images and gt.txt files, enter make prepare.

And let me know if something is missing at any time, thank you. tesstrain.tar.gz fonts.tar.gz

ghwn avatar Oct 29 '19 09:10 ghwn

Have you solved this problem? Since I met the same problem. I have a quite different OCR output compared with I use the "tesseract" command, using the same model. Could you please resond to me? Thanks

MoleImg avatar Feb 28 '20 13:02 MoleImg

No. We have not. Thanks for pinging. I try to find some time for it next week!

wrznr avatar Feb 28 '20 15:02 wrznr

No. We have not. Thanks for pinging. I try to find some time for it next week!

Thanks. I'm struggling with this problem for such a long time, but cannot find the reason/solution. Can you help me with this? Thank you so much

MoleImg avatar Mar 01 '20 03:03 MoleImg

I think the problem is when you have to stop lstmtraining and convert to integer traineddata. you may have used traineddata generated by tesstrain.sh. you should better use traineddata best

lstmtraining
--stop_training
--convert_to_int
--continue_from ../tesstutorial/impact_from_full/impact_checkpoint
--traineddata tessdata/best/eng.traineddata
--model_output ../tesstutorial/impact_from_full/eng_impact_int.traineddata

bilal-rachik avatar May 19 '20 12:05 bilal-rachik

any updates on this? I have the same problem, lstmeval and tesseract with psm -13 and the same traineddata do not match.

red-canoe avatar Dec 02 '20 03:12 red-canoe

It seems I have the same issue on the training for couple of old russian glyphs (which makes it plus char training from russian). Actually this issue slaughters all the fun from the tesseract since I suspect that recognition with this bug fixed would be dramatically better. Can issue be prioritized, please?

By the way, are there any embedded debug support for the tesseract app which can be activated?

dvrogozh avatar Dec 31 '20 00:12 dvrogozh

By the way, are there any embedded debug support for the tesseract app which can be activated?

yes, you can: build with debugging enabled and then enable any of the debug parameters you can see in tesseract --print-parameters (the most important of which is debug_file – must be non-empty to see any debug messages).

bertsky avatar Jun 07 '21 21:06 bertsky

Why the characters recognized by lstmeval and tesseract are different? Is it normal?

Yes, it's not unlikely, since the latter is much more complex – e.g. because it contains image preprocessing, page segmentation, multi-model/lang and legacy engine code. The basic function is the same though:

  • lstmeval
    lstmtester
    lstmtrainer
    LSTMTrainer::PrepareForwardBackward
    LSTMRecognizer::RecognizeLine + LabelsFromOutputs
  • tesseractmain
    TessBaseAPI::ProcessPage
    TessBaseAPI::Recognize
    Tesseract::recog_all_words
    Tesseract::classify_word_and_languageTesseract::classify_word_pass1
    Tesseract::LSTMRecognizeWord
    LSTMRecognizer::RecognizeLine + LabelsFromOutputs

I concur with @wrznr in surmising this is a PSM issue, but the OP already refuted that by trying PSM 13 to no effect.

@bilal-rachik brings in model finalization (esp. float→int conversion) which could play a role, esp. since the differences IIUC are rather small. Can anyone confirm this by trying without --convert_to_int?

It could also be related to thresholding or image normalization...

bertsky avatar Jun 07 '21 22:06 bertsky

@bilal-rachik @bertsky Is this really a tesstrain issue?

wrznr avatar Sep 08 '21 06:09 wrznr

Is this really a tesstrain issue?

You are right, this should probably be transferred to the tesseract repo.

bertsky avatar Sep 08 '21 06:09 bertsky

is there any update here? I'm having this issue where I'm using eng.traineddata and I'm getting accurate results for some test .png's using tesseract, but nonsense using lstmeval. This is messing up my training. I'm wondering if, like mentioned above, I have some configs set incorrectly

jhartungBE avatar Nov 18 '21 16:11 jhartungBE

@jhartungBE all we have at this point are suspicions (what to look for). Have you tried …

  • PSM=13 / --psm 13
  • with traineddata from tessdata_best / without --convert_to_int
  • with/out adding language models, or
  • with --configfile <(echo thresholding_method 2) / -c thresholding_method=2

… yet?

bertsky avatar Nov 18 '21 18:11 bertsky

Thanks for the quick response. Yes I have tried the first two options, but not sure what you mean on the latter two. Here's a simple example that explains my issue. I have these two example image/text pairs in test-ground-truth. I can generate the box/lstmf files using "make lists MODEL_NAME=test PSM=7". I can then run "lstmeval --model eng_tessdata_best/eng.traineddata --eval_listfile data/test/all-lstmf" and I get

Loaded 1/1 lines (1-1) of document data/test-ground-truth/Message_20210917_163.lstmf
Loaded 1/1 lines (1-1) of document data/test-ground-truth/Message_20211012_281.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Truth:any 19.25 legs?
OCR  :L
Truth:I got nothing better then 1
OCR  :BE te e  c
At iteration 0, stage 0, Eval Char error rate=94.444444, Word error rate=100

When I run in my tesseract repo, using, for example: tesseract Message_20211012_281.png test.txt --psm 7 I get perfect match of I got nothing better then 1

Message_20211012_281.gt.txt Message_20211012_281 Message_20210917_163.gt.txt Message_20210917_163

jhartungBE avatar Nov 18 '21 19:11 jhartungBE

@jhartungBE, like I said in my first comment, the Tesseract standalone CLI has much more than just the bare recognition of lstmeval – and that includes a check and compensation for inverse colours, like in your example.

So that's another issue (in fact, it's no issue IMO).

bertsky avatar Nov 18 '21 20:11 bertsky

Okay understood. Thank you. However, I'm failing to understand how I can train tesseract if this is the case and lstm training doesn't really apply to my images the same way that the tesseract engine will? Does this just mean I have to modify my images to pass to lstm training so that they are received the same way the LSTMRecognizer will receive them when I'm running tesseract?

jhartungBE avatar Nov 18 '21 21:11 jhartungBE

Yes, that's what it means. Just install ImageMagick and do a convert input.png -negate output.png

bertsky avatar Nov 18 '21 21:11 bertsky

Great, thanks. Appreciate your help

jhartungBE avatar Nov 18 '21 21:11 jhartungBE