tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Fine Tuning results in wrong transcription

Open jaddoughman opened this issue 6 years ago • 16 comments

Environment:

tesseract 4.0.0 leptonica-1.76.0 libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11 Found AVX2 Found AVX Found SSE

Platfrom:

Darwin Kernel Version 18.2.0 ; RELEASE_X86_64 x86_64

Current Behavior:

Tesseract 4.0 using the best ara.traineddata file is recalling about 85% of the data, which is pretty good. I'm attempting to train Tesseract using Fine Tuning for impact. I used the GitHub project OCR-D Train to generate the .box and .lstmf files required for training, since my training data is composed of text line images. After generating the required .box and .lstmf files, I trained tesseract with a couple of lines to 400 iterations, but the the generated transcription with the fined tuned model looks a lot like "ل.َ1ح*جُ ح( .َو!ة.اع5 ّة'عآة'ا ن'جة.!ع. ”.َئءؤئجآ| ن!.5ل". I exhausted all the possibilities by training to max_iteration 0 and and a low target_error_rate, but the results were similar.

The transcription generated by the new model can be found below (Fine Tuned.txt): Fine Tuned.txt

The transcription generated by the original Arabic model can be found below (Arabic Trained Model.txt): Arabic Trained Model.txt

The fine tuned model can be found below (test1.traineddata): test1.traineddata.zip

I attempted to train from scratch using 4000 text line images, but they weren't enough to make a difference and didn't seem logical if your trained model is recalling more than 80% of my data.

A sample of my training data which includes the .box and .lstmf is attached below: training data.zip

jaddoughman avatar Nov 27 '18 13:11 jaddoughman

Questions:

  1. Have you also tested with traineddata from tessdata_fast repo? What accuracy do you get with it?

  2. Are the errors related to recognition of Arabic numbers?

Shreeshrii avatar Nov 27 '18 14:11 Shreeshrii

The traineddata provided by Tesseract, meaning the _fast and _best both result in very good accuracy (above 80%), but I wish to increase the accuracy to above 90%. The errors I wish to fix are character based, not numbers. I think the issue is with the generated .box files, can you have a look to check if anything seems off. training data.zip

jaddoughman avatar Nov 27 '18 14:11 jaddoughman

@jbreiden It seems to me that we are missing some key item regarding Arabic training. There have been multiple reports of users being unsuccessful in improving results with fine tuning.

The training text in langdata_lstm/ara is only 80 lines or so. Can you please check with Ray regarding the correct data to be used for Fine tuning Arabic and any pointers he may be able to provide.

Thanks!!!

Shreeshrii avatar Nov 27 '18 14:11 Shreeshrii

An example of the generated box file by ord-train has the same coordinates for every letter in each word (example below), does this create an issue when training ?

م 0 0 894 67 0 ب 0 0 894 67 0 ا 0 0 894 67 0 ر 0 0 894 67 0 0 0 894 67 0 ي 0 0 894 67 0 ت 0 0 894 67 0 ع 0 0 894 67 0 ه 0 0 894 67 0

jaddoughman avatar Nov 27 '18 14:11 jaddoughman

@theraysmith

jaddoughman avatar Nov 27 '18 14:11 jaddoughman

Below is a couple of iterations of the ALIGNED TRUTH and BEST OCR TEXT. I would like to know the issue behind the difference.

screenshot from 2018-11-28 10-21-54

Iteration 1738: ALIGNED TRUTH : اوااوووضواضاضااعاعع اللفنادادققدق وممطططالالب ااصحاحااب Iteration 1738: BEST OCR TEXT : اامام للل لل

jaddoughman avatar Nov 28 '18 08:11 jaddoughman

Any help would be greatly appreciated. @theraysmith @madhumurali2295 @Shreeshrii

jaddoughman avatar Nov 28 '18 08:11 jaddoughman

See #735

amitdo avatar Nov 28 '18 13:11 amitdo

This helps in understanding the debug procedure, but doesn't address my issue. My issue is with the results of the fine tuning. Any help would be greatly appreciated. @amitdo

jaddoughman avatar Nov 28 '18 13:11 jaddoughman

I used the GitHub project OCR-D Train to generate the .box and .lstmf files required for training

Do they handle bidi text?

amitdo avatar Nov 28 '18 14:11 amitdo

The generated .box and .lstmf files alongside the corresponding .tif images are attached below. Can't we tell by the generated .box files if it supports bidi text ? @amitdo training data.zip

jaddoughman avatar Nov 28 '18 17:11 jaddoughman

The chars in the box files need to be in visual order from left to right, but the chars in your box files are in logical order from right to left.

@kba

amitdo avatar Nov 28 '18 17:11 amitdo

The training text in langdata_lstm/ara is only 80 lines or so.

@Shreeshrii, Please report about this specific issue in: https://github.com/tesseract-ocr/langdata_lstm

amitdo avatar Nov 28 '18 22:11 amitdo

@amitdo @Shreeshrii

I fixed the RTL issue using fribidi. My dataset now is in LTR order. I generated the box files and lstmf files necessary. How many text lines do i need to fine tune the existing _best arabic model ? How many iterations should i run ?

Makefile used is attached below: Makefile.zip

jaddoughman avatar Dec 05 '18 18:12 jaddoughman

@jaddoughman - Are you able to get desired accuracy now ?? I am facing the same issue but with english language. I saw your training data and quality wise my images are also similar. Any help would be appreciated.

harshaneekhra avatar Jan 23 '20 12:01 harshaneekhra

Facing same issues. @jaddoughman @harshaneekhra , did you find a solution ?

forzagreen avatar Dec 25 '22 16:12 forzagreen

The performance gets improved when training with the option START_MODEL=ara

As described in tesstrain readme:

    START_MODEL        Name of the model to continue from. Default: ''

forzagreen avatar Jan 05 '23 12:01 forzagreen