tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Report on RTL training with OCR_GS_Data for Arabic

Open Shreeshrii opened this issue 5 years ago • 11 comments

Similar to @stweil's training for Fraktur, I am collecting here info regarding finetune RTL training with OCR_GS_Data for Arabic. Some of this has already been reported elsewhere in other threads earlier.

Shreeshrii avatar Dec 01 '19 05:12 Shreeshrii

OCR_GS_Data has Double-checked Gold Standard Data for Training and Testing OCR Engines for RTL languages. It includes Arabic data used for Important New Developments in Arabographic Optical Character Recognition (OCR).

I have used (so far) only a subset of the datasets referred to in the above report, work #0, #1, #2, #5 and #6 , since they are said to have similar typeface. These have approximately 10000 single line images and their transcription. (As per the report it should be approx. 5000 text lines. The images are at high and low (200 dpi) resolution, hence doubling the number).

Shreeshrii avatar Dec 01 '19 05:12 Shreeshrii

Initial test comparison, with testing only for work#0 (Buldan) is shown in a table in this post.

Second run of training using all five datasets referenced above with certain modifications show better results, when evaluated on the same training data.

Type Rate
CER 0.70
WER 1.71
WER (order independent) 1.65

Shreeshrii avatar Dec 01 '19 06:12 Shreeshrii

Issues with groundtruth and images:

Shreeshrii avatar Dec 01 '19 07:12 Shreeshrii

Issues with tesstrain Makefile:

  • RTL text not handled correctly either by generate_line_box.py or by generate_wordstr_box.py. See comment.

PR https://github.com/tesseract-ocr/tesstrain/pull/127 proposes a new script to handle these.

Shreeshrii avatar Dec 01 '19 07:12 Shreeshrii

Issues with tesseract an text2image:

  • Need to use --psm 13 for correct recognition.
  • Need to use -c page_separator=''
  • text2image does not create correct charboxes for certain images (Bad box coordinates in boxfile string! ح). The resulting wordstrbox files for these images have more than 2 lines. Discard these images otherwise training does not converge at all. This can be done as follows
find /home/ubuntu/OCR_GS_Data/ara/ground-truth -type f -name '*Buldan*.box' -exec bash -c '[[ $(wc -l < "$1") -gt 1 ]] && echo "$1"' _ '{}' \;  > err.txt
sed -i -e 's/^/rm /' err.txt
sed -i -e 's/box/*/' err.txt
bash err.txt

Shreeshrii avatar Dec 01 '19 07:12 Shreeshrii

Issues with ocr evaluation tools:

For large texts, the accuracy reports are not generated. Error is accuracy: text stream is too long. See issue.

Alternative is to use https://github.com/impactcentre/ocrevalUAtion

Sample report : eval-Buldan-araKraken.html.txt

Shreeshrii avatar Dec 01 '19 07:12 Shreeshrii

See https://github.com/Shreeshrii/tesstrain-arabic-GS for current training data and reports

Shreeshrii avatar Dec 19 '19 18:12 Shreeshrii

@Shreeshrii I wonder if/how the ordering of punctuation chars affects training.

Given a line image like https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final_b/a_000716.png, compared with it's transcription (https://github.com/OpenITI/OCR_GS_Data/blob/master/ara/book_IbnFaqihHamadhani.Buldan/7_final/a_000004.gt.txt), it seems to me the double colon is not in proper position, since the transcription places it right most, but within the image it is left-end.

I've seen this turns in many text lines, also in https://github.com/OpenITI/TrainingData/tree/master/JSTORArabic used at https://github.com/tesseract-ocr/tesstrain/issues/213

M3ssman avatar Dec 16 '20 08:12 M3ssman

Punctuation marks are an open issue. Someone with knowledge of Arabic and bidi will have to look at it and suggest a solution.

generate_wordstr_box.py uses bidi but leaves punctuation as is.

Shreeshrii avatar Dec 16 '20 08:12 Shreeshrii

Maybe currently it's convenient to eliminate punctuation from training data? Our focus is on letters.

The PR-Request https://github.com/tesseract-ocr/tesstrain/pull/205 tries to sanitize this by wiping off any RTL-unicode direction marks, which otherwise make it tricky just to follow with the arrow keys char-by-char, especially with punctuation and other non-arabic mixed-ins. I guess punctuation is, apart from usual arabic, considered as char with "normal" LTR reading order, like any non-arabic digits (latin, indic or whatever) and therefore turned right-end.

M3ssman avatar Dec 16 '20 10:12 M3ssman

@Shreeshrii, what is the final status of your efforts regarding fine tuning with the compete GS_Data set? The link above https://github.com/Shreeshrii/tesstrain-arabic-GS is not available any more?

MihoMahi avatar Mar 07 '22 11:03 MihoMahi