tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

My trained model can't recognize well some lines

Open josef821 opened this issue 2 years ago • 6 comments

Environment

  • Tesseract Version: 5.2.0-1
  • Platform: Windows 64 And Ubuntu 20.4

Current Behavior:

i trained my own traineddata from scratch with 800K lines. in some image likes this it can't recognize column lines ( psm 6 or 4 ). it works good when i select every line separately. ray's Arabic and fas ( Farsi ) trained data works good ( it can recognize lines good )

MyIncorrect Original Image for test : 100

Expected Behavior:

I imagined tesseract get lines by image processing and then every line will recognize with psm 7. but i see traineddata will affect on column line recognizer. What should I do in the training phase to solve such a problem?

Suggested Fix:

i try to adjust some parameters and i found this parameters. when i set each of them output will be better but how tessdata_best will works without this parameters? textord_min_xheight=0 textord_really_old_xheight=1 textord_old_xheight=1

i try to adjust some xheight to my training data but problem not solved. Files : My example ground truth : fas-ground-truth.zip ( numbers are all english ) My Traineddata : MyTrainedData.zip

josef821 avatar Sep 30 '22 09:09 josef821

Use combine_tessdata to extract a traineddata file. Compare your ara/fas config file to the official one.

amitdo avatar Sep 30 '22 11:09 amitdo

i was do that. fas has no config file and works good. ara has config but even without config file it works good. i extract all file and then traineddata only with lstm , lstm-recoder and lstm-unicharset and remove other files ( line version , wordlist etc ) but it still works good. I imagine it's all about training data. Do I need to refer to previous versions such as 4.00.00 and use this version like the official models?

josef821 avatar Sep 30 '22 11:09 josef821

If the official model works well without a config file and your custom model does not, I don't know what's causing this issue and how it can be solved.

amitdo avatar Sep 30 '22 11:09 amitdo

Are the official best files done by Ray? Do you know how the images were produced? with text2image or self image line generator?

josef821 avatar Sep 30 '22 11:09 josef821

Are the official best files done by Ray?

Yes.

For your other questions, I don't know.

amitdo avatar Sep 30 '22 11:09 amitdo

would you mind sharing on how to train tesseract using custom dataset?

ramdhan1989 avatar Oct 27 '22 13:10 ramdhan1989