tesstrain
tesstrain copied to clipboard
Report on RTL training with OCR_GS_Data for Arabic
Similar to @stweil's training for Fraktur, I am collecting here info regarding finetune RTL training with OCR_GS_Data for Arabic. Some of this has already been reported elsewhere in other threads earlier.
OCR_GS_Data has Double-checked Gold Standard Data for Training and Testing OCR Engines for RTL languages. It includes Arabic data used for Important New Developments in Arabographic Optical Character Recognition (OCR).
I have used (so far) only a subset of the datasets referred to in the above report, work #0, #1, #2, #5 and #6 , since they are said to have similar typeface. These have approximately 10000 single line images and their transcription. (As per the report it should be approx. 5000 text lines. The images are at high and low (200 dpi) resolution, hence doubling the number).
Initial test comparison, with testing only for work#0 (Buldan) is shown in a table in this post.
Second run of training using all five datasets referenced above with certain modifications show better results, when evaluated on the same training data.
Type | Rate |
---|---|
CER | 0.70 |
WER | 1.71 |
WER (order independent) | 1.65 |
Issues with groundtruth and images:
-
space at beginning/end of some transcriptions (could lead to hallucination effect)
-
some images are not tightly cropped so using the bbox for whole image does not match the actual bbox for text.
Issues with tesstrain Makefile:
- RTL text not handled correctly either by
generate_line_box.py
or bygenerate_wordstr_box.py
. See comment.
PR https://github.com/tesseract-ocr/tesstrain/pull/127 proposes a new script to handle these.
Issues with tesseract an text2image:
- Need to use
--psm 13
for correct recognition. - Need to use
-c page_separator=''
- text2image does not create correct charboxes for certain images (Bad box coordinates in boxfile string! ح). The resulting wordstrbox files for these images have more than 2 lines. Discard these images otherwise training does not converge at all. This can be done as follows
find /home/ubuntu/OCR_GS_Data/ara/ground-truth -type f -name '*Buldan*.box' -exec bash -c '[[ $(wc -l < "$1") -gt 1 ]] && echo "$1"' _ '{}' \; > err.txt
sed -i -e 's/^/rm /' err.txt
sed -i -e 's/box/*/' err.txt
bash err.txt
Issues with ocr evaluation tools:
For large texts, the accuracy reports are not generated. Error is accuracy: text stream is too long
. See issue.
Alternative is to use https://github.com/impactcentre/ocrevalUAtion
Sample report : eval-Buldan-araKraken.html.txt
See https://github.com/Shreeshrii/tesstrain-arabic-GS for current training data and reports
@Shreeshrii I wonder if/how the ordering of punctuation chars affects training.
Given a line image like https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final_b/a_000716.png, compared with it's transcription (https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.gt.txt), it seems to me the double colon is not in proper position, since the transcription places it right most, but within the image it is left-end.
I've seen this turns in many text lines, also in https://github.com/OpenITI/TrainingData/tree/master/JSTORArabic used at https://github.com/tesseract-ocr/tesstrain/issues/213
Punctuation marks are an open issue. Someone with knowledge of Arabic and bidi will have to look at it and suggest a solution.
generate_wordstr_box.py uses bidi but leaves punctuation as is.
Maybe currently it's convenient to eliminate punctuation from training data? Our focus is on letters.
The PR-Request https://github.com/tesseract-ocr/tesstrain/pull/205 tries to sanitize this by wiping off any RTL-unicode direction marks, which otherwise make it tricky just to follow with the arrow keys char-by-char, especially with punctuation and other non-arabic mixed-ins. I guess punctuation is, apart from usual arabic, considered as char with "normal" LTR reading order, like any non-arabic digits (latin, indic or whatever) and therefore turned right-end.
@Shreeshrii, what is the final status of your efforts regarding fine tuning with the compete GS_Data set? The link above https://github.com/Shreeshrii/tesstrain-arabic-GS is not available any more?