tesseract
tesseract copied to clipboard
Fine Tuning results in wrong transcription
Environment:
tesseract 4.0.0 leptonica-1.76.0 libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11 Found AVX2 Found AVX Found SSE
Platfrom:
Darwin Kernel Version 18.2.0 ; RELEASE_X86_64 x86_64
Current Behavior:
Tesseract 4.0 using the best ara.traineddata file is recalling about 85% of the data, which is pretty good. I'm attempting to train Tesseract using Fine Tuning for impact. I used the GitHub project OCR-D Train to generate the .box and .lstmf files required for training, since my training data is composed of text line images. After generating the required .box and .lstmf files, I trained tesseract with a couple of lines to 400 iterations, but the the generated transcription with the fined tuned model looks a lot like "ل.َ1ح*جُ ح( .َو!ة.اع5 ّة'عآة'ا ن'جة.!ع. ”.َئءؤئجآ| ن!.5ل". I exhausted all the possibilities by training to max_iteration 0 and and a low target_error_rate, but the results were similar.
The transcription generated by the new model can be found below (Fine Tuned.txt): Fine Tuned.txt
The transcription generated by the original Arabic model can be found below (Arabic Trained Model.txt): Arabic Trained Model.txt
The fine tuned model can be found below (test1.traineddata): test1.traineddata.zip
I attempted to train from scratch using 4000 text line images, but they weren't enough to make a difference and didn't seem logical if your trained model is recalling more than 80% of my data.
A sample of my training data which includes the .box and .lstmf is attached below: training data.zip
Questions:
-
Have you also tested with traineddata from tessdata_fast repo? What accuracy do you get with it?
-
Are the errors related to recognition of Arabic numbers?
The traineddata provided by Tesseract, meaning the _fast and _best both result in very good accuracy (above 80%), but I wish to increase the accuracy to above 90%. The errors I wish to fix are character based, not numbers. I think the issue is with the generated .box files, can you have a look to check if anything seems off. training data.zip
@jbreiden It seems to me that we are missing some key item regarding Arabic training. There have been multiple reports of users being unsuccessful in improving results with fine tuning.
The training text in langdata_lstm/ara is only 80 lines or so. Can you please check with Ray regarding the correct data to be used for Fine tuning Arabic and any pointers he may be able to provide.
Thanks!!!
An example of the generated box file by ord-train has the same coordinates for every letter in each word (example below), does this create an issue when training ?
م 0 0 894 67 0 ب 0 0 894 67 0 ا 0 0 894 67 0 ر 0 0 894 67 0 0 0 894 67 0 ي 0 0 894 67 0 ت 0 0 894 67 0 ع 0 0 894 67 0 ه 0 0 894 67 0
@theraysmith
Below is a couple of iterations of the ALIGNED TRUTH and BEST OCR TEXT. I would like to know the issue behind the difference.
Iteration 1738: ALIGNED TRUTH : اوااوووضواضاضااعاعع اللفنادادققدق وممطططالالب ااصحاحااب Iteration 1738: BEST OCR TEXT : اامام للل لل
Any help would be greatly appreciated. @theraysmith @madhumurali2295 @Shreeshrii
See #735
This helps in understanding the debug procedure, but doesn't address my issue. My issue is with the results of the fine tuning. Any help would be greatly appreciated. @amitdo
I used the GitHub project OCR-D Train to generate the .box and .lstmf files required for training
Do they handle bidi text?
The generated .box and .lstmf files alongside the corresponding .tif images are attached below. Can't we tell by the generated .box files if it supports bidi text ? @amitdo training data.zip
The chars in the box files need to be in visual order from left to right, but the chars in your box files are in logical order from right to left.
@kba
The training text in langdata_lstm/ara is only 80 lines or so.
@Shreeshrii, Please report about this specific issue in: https://github.com/tesseract-ocr/langdata_lstm
@amitdo @Shreeshrii
I fixed the RTL issue using fribidi. My dataset now is in LTR order. I generated the box files and lstmf files necessary. How many text lines do i need to fine tune the existing _best arabic model ? How many iterations should i run ?
Makefile used is attached below: Makefile.zip
@jaddoughman - Are you able to get desired accuracy now ?? I am facing the same issue but with english language. I saw your training data and quality wise my images are also similar. Any help would be appreciated.
Facing same issues. @jaddoughman @harshaneekhra , did you find a solution ?
The performance gets improved when training with the option START_MODEL=ara
As described in tesstrain readme:
START_MODEL Name of the model to continue from. Default: ''