tesseract Arabic Numbers

Environment

Tesseract Version: Current main repository (4.00.00alpha)
Platform: Windows7 32-bit

Current Behavior:

Its recognize Arabic Characters and can not recognize Arabic numbers (ارقام عربى 0123456789) I tried tessdata, tessdata_best, and tessdata_fast

Expected Behavior:

Suggested Fix:

Oct 30 '17 11:10 ahmed-tea

Did you try Arabic.traineddata?

Oct 30 '17 13:10 amitdo

@amitdo yes

Oct 30 '17 13:10 ahmed-tea

It is recognize the characters (80% included the Latin numbers) and it does not recognize the Arabic numbers inside the red rectangle (the original without red rectangle )
my-national-identity-card-1-728

I tried other pics with numbers only and i got no numbers arabnum

Oct 31 '17 08:10 ahmed-tea

@theraysmith has not updated the repositories with changes to handle all these issues. Hence, you should not expect them to be fixed.

Oct 31 '17 11:10 Shreeshrii

@Shreeshrii @theraysmith Is there a changes handle all these issues but the repositories did not update yet or there is no fix ?

Oct 31 '17 11:10 ahmed-tea

I think Ray was planning to do new training to handle all these cases. But there has been no update from him since then. Based on past patterns, I would guess that he will make some updates to project before year end!

Oct 31 '17 11:10 Shreeshrii

Definition

AEN Arabic Eastern Numbers {ِ123456789}
AWN Arabic Western Numbers {0123456789}

I generated an experimental data file for recognaize AEN Only The output of Tesseract OCR will be in the form of AWN

https://github.com/ahmed-tea/tessdata_Arabic_Numbers

@Shreeshrii @theraysmith

Nov 01 '17 16:11 ahmed-tea

Thanks for sharing the traineddata. Please let us know the succeed rate of OCR when using it.

Do you combine it with Arabic traineddata to get correct text plus Arabic numbers using -l Ara+

Nov 02 '17 03:11 Shreeshrii

The succeed rate for the pics above 100% (numbers only) but it depends on the pic quality in general

Combining with ara - current tesseract main repository : give an error (mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file classify\adaptmatch.cpp, line 537) - tesseract build by UB Mannheim : give numbers only - best and fast (ara and Arabic) : not applicable because they used for LSTM only so it give numbers only

@Shreeshrii

Nov 02 '17 15:11 ahmed-tea

@Shreeshrii sorry for the question, how to combine the new tessdata_Arabic_Numbers with the current one? I copied ara_number.traineddata into tessdata dir then I use this command: tesseract -l ara_number+ara image.tif out.txt

but doesn't work

Nov 23 '17 06:11 Fahad-Alsaidi

@Fahad-Alsaidi You can't combine it with ara

Nov 26 '17 08:11 ahmed-tea

Is there ara.traindata which has been tested and verified to recognize Arabic eastern numbers? Please share a link if available. @ahmed-tea followinf link returns error, any alternative? https://github.com/ahmed-tea/tessdata_Arabic_Numbers

Many thanks

May 25 '18 13:05 raminas81

@raminas81 I think the error because it works for OEM_TESSERACT_ONLY (The old engin) It can't combine it with ara.traindata

Jun 21 '18 11:06 ahmed-tea

Hi, I'm trying to recognize Arabic number using tesseract 3.04. The results using https://github.com/tesseract-ocr/tessdata/tree/3.04.00 train data from here with the cube files of course are very random and most of the recognize digits are wrong, is there any other traineddata file to use for only numbers, in tesseract 3.04.

one more thing and i would be very great full , if i want to include a white List for Arabib recognition how this can be done ? when i use English recognition i done it as below

thank you so much

Jul 31 '18 03:07 AbdelsalamHaa

@ahmed-tea Hi , i have used your Arabic number trained file for tesseract 4 and it's very good. I'm trying to do the same file but for tesseract 3.04, i could do it but the results are return in arabic as well not like your case where the numbers are return in English. I want my results to be return in English coz there's a lot of flips between the numbers order due to the language start from right to left when the results are return in arabic. i hope you can help in this thank you so much in advance

Aug 10 '18 01:08 AbdelsalamHaa

@AbdelsalamHaa Use jTessBoxEditor https://github.com/nguyenq/jTessBoxEditor by @nguyenq After Box Generating and before training readjust the char corresponding to each box The tool will not accept to enter Arabic numbers so you have to enter the English number The OCR will read the Arabic number but the output will be English number

Aug 16 '18 14:08 ahmed-tea

You can use your Arabic input method to enter Arabic digits, or use the built-in conversion tool. At Character textbox, e.g., enter U+0668 and click the adjacent button twice or press Enter key.

Aug 16 '18 22:08 nguyenq

@AbdelsalamHaa 1- Make an Arabic Numbers jpg image 2- In Trainer Tab select the jpg image for training data 3- Set language with the name you want for the tesseract data file 4- Select Make Box File only then run 5- In Box Editor open the jpg image 6- For each box in the image you will find corresponding character in column char (it will be wrong character) 7- Readjust each char with respect to each box (it will not accept Arabic numbers so you had to enter English numbers ) 8- Save 9- Go to Trainer Tab and select Train with Existing Box and run

@nguyenq I tried your method The output of OCR reorganization is the Unicode not the number

Aug 19 '18 12:08 ahmed-tea

@Ahmed-tea Thanks for sharing the training file. I’ve downloaded it but did not know how to add to tesseract training files Can you share any guide ?

Sep 23 '18 23:09 WaelKamel116

@ahmed-tea : is this issue solved?

Sep 29 '18 11:09 zdenop

@ahmed-tea did you succeed to combine arabic numbers and arabic words together ?

Nov 23 '18 01:11 AndreAhmed

@Shreeshrii hello I have some questions:

what is the best tool to train the engine some language?
is there a minimum size for the training dataset or image?

Dec 15 '18 14:12 salemalbadawi

we face a problem when we train the OCR on Indian numbers ( ١٢٣٤٥٦٧٨٩٠ ) also, we get a bad result when we try to read an image with a mix of Arabic and Indian numbers paragraph Any suggestions?

@Shreeshrii @zdenop

Dec 16 '18 07:12 salemalbadawi

I have tried different types of fine tuning for adding the numbers but have not had much success. I think that the open source tesseract is missing some key component related to Arabic. We will have to wait till @theraysmith or @jbreiden can investigate and fix this.

On Sun, 16 Dec 2018, 02:48 salemalbadawi <[email protected] wrote:

we face a problem when we train the OCR on Indian numbers ( ١٢٣٤٥٦٧٨٩٠ ) also, we get a bad result when we try to read an image with a mix of Arabic and Indian numbers paragraph Any suggestions?

@Shreeshrii https://github.com/Shreeshrii

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1193#issuecomment-447623980, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxNB2U5JWwrZSnRbFpoQisqr2UYZks5u5frJgaJpZM4QLERQ .

Dec 16 '18 12:12 Shreeshrii

Reports with latest versions:

Arabic-Indic numbers incorrectly recognized #2864

Some Arabic-Indic numbers are being reversed #2897

Feb 29 '20 14:02 Shreeshrii

https://github.com/Shreeshrii/tessdata_arabic this link may help, it helped me a lot.

May 21 '20 04:05 BasmaFahmy

Is there any Indic-Arabic numeral (only) dataset for training tesseract? images+ground truth

May 25 '20 21:05 sam-kurdi

Definition
* **AEN** Arabic Eastern Numbers {ِ123456789}

* **AWN** Arabic Western Numbers {0123456789}
I generated an experimental data file for recognaize AEN Only The output of Tesseract OCR will be in the form of AWN

https://github.com/ahmed-tea/tessdata_Arabic_Numbers

@Shreeshrii @theraysmith

it has a wrong link text of link is correct but url embded in link is invalid

Aug 03 '20 14:08 MahmoudMabrok

https://github.com/jishakrishnan/pytrsseract-arabic - try this out

Mar 01 '21 12:03 jishakrishnan

Hi @Shreeshrii .

e6d835cf894cfa4a I have this example the date تار يخ السداد appears like this

تار يخ السداد : 48./١./١٠؟١٠؟‏

Any suggestion , Thanks

Nov 25 '21 12:11 engahmed1190

tesseract tesseract copied to clipboard

Arabic Numbers

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

Definition

Definition

tesseract
tesseract copied to clipboard