tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Arabic Numbers

Open ahmed-tea opened this issue 6 years ago • 40 comments

Environment

  • Tesseract Version: Current main repository (4.00.00alpha)
  • Platform: Windows7 32-bit

Current Behavior:

Its recognize Arabic Characters and can not recognize Arabic numbers (ارقام عربى 0123456789) I tried tessdata, tessdata_best, and tessdata_fast

Expected Behavior:

Suggested Fix:

ahmed-tea avatar Oct 30 '17 11:10 ahmed-tea

Did you try Arabic.traineddata?

amitdo avatar Oct 30 '17 13:10 amitdo

@amitdo yes

ahmed-tea avatar Oct 30 '17 13:10 ahmed-tea

It is recognize the characters (80% included the Latin numbers) and it does not recognize the Arabic numbers inside the red rectangle (the original without red rectangle )
my-national-identity-card-1-728

I tried other pics with numbers only and i got no numbers arabnum page0001

ahmed-tea avatar Oct 31 '17 08:10 ahmed-tea

@theraysmith has not updated the repositories with changes to handle all these issues. Hence, you should not expect them to be fixed.

Shreeshrii avatar Oct 31 '17 11:10 Shreeshrii

@Shreeshrii @theraysmith Is there a changes handle all these issues but the repositories did not update yet or there is no fix ?

ahmed-tea avatar Oct 31 '17 11:10 ahmed-tea

I think Ray was planning to do new training to handle all these cases. But there has been no update from him since then. Based on past patterns, I would guess that he will make some updates to project before year end!

Shreeshrii avatar Oct 31 '17 11:10 Shreeshrii

Definition

  • AEN Arabic Eastern Numbers {ِ123456789}
  • AWN Arabic Western Numbers {0123456789}

I generated an experimental data file for recognaize AEN Only The output of Tesseract OCR will be in the form of AWN

https://github.com/ahmed-tea/tessdata_Arabic_Numbers

@Shreeshrii @theraysmith

ahmed-tea avatar Nov 01 '17 16:11 ahmed-tea

Thanks for sharing the traineddata. Please let us know the succeed rate of OCR when using it.

Do you combine it with Arabic traineddata to get correct text plus Arabic numbers using -l Ara+

Shreeshrii avatar Nov 02 '17 03:11 Shreeshrii

The succeed rate for the pics above 100% (numbers only) but it depends on the pic quality in general

Combining with ara - current tesseract main repository : give an error (mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file classify\adaptmatch.cpp, line 537) - tesseract build by UB Mannheim : give numbers only - best and fast (ara and Arabic) : not applicable because they used for LSTM only so it give numbers only

@Shreeshrii

ahmed-tea avatar Nov 02 '17 15:11 ahmed-tea

@Shreeshrii sorry for the question, how to combine the new tessdata_Arabic_Numbers with the current one? I copied ara_number.traineddata into tessdata dir then I use this command: tesseract -l ara_number+ara image.tif out.txt

but doesn't work

Fahad-Alsaidi avatar Nov 23 '17 06:11 Fahad-Alsaidi

@Fahad-Alsaidi You can't combine it with ara

ahmed-tea avatar Nov 26 '17 08:11 ahmed-tea

Is there ara.traindata which has been tested and verified to recognize Arabic eastern numbers? Please share a link if available. @ahmed-tea followinf link returns error, any alternative? https://github.com/ahmed-tea/tessdata_Arabic_Numbers

Many thanks

raminas81 avatar May 25 '18 13:05 raminas81

@raminas81 I think the error because it works for OEM_TESSERACT_ONLY (The old engin) It can't combine it with ara.traindata

ahmed-tea avatar Jun 21 '18 11:06 ahmed-tea

Hi, I'm trying to recognize Arabic number using tesseract 3.04. The results using https://github.com/tesseract-ocr/tessdata/tree/3.04.00 train data from here with the cube files of course are very random and most of the recognize digits are wrong, is there any other traineddata file to use for only numbers, in tesseract 3.04.

one more thing and i would be very great full , if i want to include a white List for Arabib recognition how this can be done ? when i use English recognition i done it as below image

thank you so much

AbdelsalamHaa avatar Jul 31 '18 03:07 AbdelsalamHaa

@ahmed-tea Hi , i have used your Arabic number trained file for tesseract 4 and it's very good. I'm trying to do the same file but for tesseract 3.04, i could do it but the results are return in arabic as well not like your case where the numbers are return in English. I want my results to be return in English coz there's a lot of flips between the numbers order due to the language start from right to left when the results are return in arabic. i hope you can help in this thank you so much in advance

AbdelsalamHaa avatar Aug 10 '18 01:08 AbdelsalamHaa

@AbdelsalamHaa Use jTessBoxEditor https://github.com/nguyenq/jTessBoxEditor by @nguyenq After Box Generating and before training readjust the char corresponding to each box The tool will not accept to enter Arabic numbers so you have to enter the English number The OCR will read the Arabic number but the output will be English number

ahmed-tea avatar Aug 16 '18 14:08 ahmed-tea

You can use your Arabic input method to enter Arabic digits, or use the built-in conversion tool. At Character textbox, e.g., enter U+0668 and click the adjacent button twice or press Enter key.

nguyenq avatar Aug 16 '18 22:08 nguyenq

@AbdelsalamHaa 1- Make an Arabic Numbers jpg image 2- In Trainer Tab select the jpg image for training data 3- Set language with the name you want for the tesseract data file 4- Select Make Box File only then run 5- In Box Editor open the jpg image 6- For each box in the image you will find corresponding character in column char (it will be wrong character) 7- Readjust each char with respect to each box (it will not accept Arabic numbers so you had to enter English numbers ) 8- Save 9- Go to Trainer Tab and select Train with Existing Box and run

@nguyenq I tried your method The output of OCR reorganization is the Unicode not the number

ahmed-tea avatar Aug 19 '18 12:08 ahmed-tea

@Ahmed-tea Thanks for sharing the training file. I’ve downloaded it but did not know how to add to tesseract training files Can you share any guide ?

WaelKamel116 avatar Sep 23 '18 23:09 WaelKamel116

@ahmed-tea : is this issue solved?

zdenop avatar Sep 29 '18 11:09 zdenop

@ahmed-tea did you succeed to combine arabic numbers and arabic words together ?

AndreAhmed avatar Nov 23 '18 01:11 AndreAhmed

@Shreeshrii hello I have some questions:

  1. what is the best tool to train the engine some language?
  2. is there a minimum size for the training dataset or image?

salemalbadawi avatar Dec 15 '18 14:12 salemalbadawi

we face a problem when we train the OCR on Indian numbers ( ١٢٣٤٥٦٧٨٩٠ ) also, we get a bad result when we try to read an image with a mix of Arabic and Indian numbers paragraph Any suggestions?

@Shreeshrii @zdenop

salemalbadawi avatar Dec 16 '18 07:12 salemalbadawi

I have tried different types of fine tuning for adding the numbers but have not had much success. I think that the open source tesseract is missing some key component related to Arabic. We will have to wait till @theraysmith or @jbreiden can investigate and fix this.

On Sun, 16 Dec 2018, 02:48 salemalbadawi <[email protected] wrote:

we face a problem when we train the OCR on Indian numbers ( ١٢٣٤٥٦٧٨٩٠ ) also, we get a bad result when we try to read an image with a mix of Arabic and Indian numbers paragraph Any suggestions?

@Shreeshrii https://github.com/Shreeshrii

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1193#issuecomment-447623980, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxNB2U5JWwrZSnRbFpoQisqr2UYZks5u5frJgaJpZM4QLERQ .

Shreeshrii avatar Dec 16 '18 12:12 Shreeshrii

Reports with latest versions:

Arabic-Indic numbers incorrectly recognized #2864

Some Arabic-Indic numbers are being reversed #2897

Shreeshrii avatar Feb 29 '20 14:02 Shreeshrii

https://github.com/Shreeshrii/tessdata_arabic this link may help, it helped me a lot.

BasmaFahmy avatar May 21 '20 04:05 BasmaFahmy

Is there any Indic-Arabic numeral (only) dataset for training tesseract? images+ground truth

sam-kurdi avatar May 25 '20 21:05 sam-kurdi

Definition

* **AEN** Arabic Eastern Numbers {ِ123456789}

* **AWN** Arabic Western Numbers {0123456789}

I generated an experimental data file for recognaize AEN Only The output of Tesseract OCR will be in the form of AWN

https://github.com/ahmed-tea/tessdata_Arabic_Numbers

@Shreeshrii @theraysmith

it has a wrong link text of link is correct but url embded in link is invalid

MahmoudMabrok avatar Aug 03 '20 14:08 MahmoudMabrok

https://github.com/jishakrishnan/pytrsseract-arabic - try this out

jishakrishnan avatar Mar 01 '21 12:03 jishakrishnan

Hi @Shreeshrii .

e6d835cf894cfa4a I have this example the date تار يخ السداد appears like this

تار يخ السداد : 48./١./١٠؟١٠؟‏

Any suggestion , Thanks

engahmed1190 avatar Nov 25 '21 12:11 engahmed1190