tesseract
tesseract copied to clipboard
Arabic Numbers
Environment
- Tesseract Version: Current main repository (4.00.00alpha)
- Platform: Windows7 32-bit
Current Behavior:
Its recognize Arabic Characters and can not recognize Arabic numbers (ارقام عربى 0123456789) I tried tessdata, tessdata_best, and tessdata_fast
Expected Behavior:
Suggested Fix:
Did you try Arabic.traineddata?
@amitdo yes
It is recognize the characters (80% included the Latin numbers) and it does not recognize the Arabic numbers inside the red rectangle (the original without red rectangle )
I tried other pics with numbers only and i got no numbers
@theraysmith has not updated the repositories with changes to handle all these issues. Hence, you should not expect them to be fixed.
@Shreeshrii @theraysmith Is there a changes handle all these issues but the repositories did not update yet or there is no fix ?
I think Ray was planning to do new training to handle all these cases. But there has been no update from him since then. Based on past patterns, I would guess that he will make some updates to project before year end!
Definition
- AEN Arabic Eastern Numbers {ِ123456789}
- AWN Arabic Western Numbers {0123456789}
I generated an experimental data file for recognaize AEN Only The output of Tesseract OCR will be in the form of AWN
https://github.com/ahmed-tea/tessdata_Arabic_Numbers
@Shreeshrii @theraysmith
Thanks for sharing the traineddata. Please let us know the succeed rate of OCR when using it.
Do you combine it with Arabic traineddata to get correct text plus Arabic numbers using -l Ara+
The succeed rate for the pics above 100% (numbers only) but it depends on the pic quality in general
Combining with ara - current tesseract main repository : give an error (mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file classify\adaptmatch.cpp, line 537) - tesseract build by UB Mannheim : give numbers only - best and fast (ara and Arabic) : not applicable because they used for LSTM only so it give numbers only
@Shreeshrii
@Shreeshrii sorry for the question, how to combine the new tessdata_Arabic_Numbers with the current one?
I copied ara_number.traineddata into tessdata dir then I use this command:
tesseract -l ara_number+ara image.tif out.txt
but doesn't work
@Fahad-Alsaidi You can't combine it with ara
Is there ara.traindata which has been tested and verified to recognize Arabic eastern numbers? Please share a link if available. @ahmed-tea followinf link returns error, any alternative? https://github.com/ahmed-tea/tessdata_Arabic_Numbers
Many thanks
@raminas81 I think the error because it works for OEM_TESSERACT_ONLY (The old engin) It can't combine it with ara.traindata
Hi, I'm trying to recognize Arabic number using tesseract 3.04. The results using https://github.com/tesseract-ocr/tessdata/tree/3.04.00 train data from here with the cube files of course are very random and most of the recognize digits are wrong, is there any other traineddata file to use for only numbers, in tesseract 3.04.
one more thing and i would be very great full , if i want to include a white List for Arabib recognition how this can be done ?
when i use English recognition i done it as below
thank you so much
@ahmed-tea Hi , i have used your Arabic number trained file for tesseract 4 and it's very good. I'm trying to do the same file but for tesseract 3.04, i could do it but the results are return in arabic as well not like your case where the numbers are return in English. I want my results to be return in English coz there's a lot of flips between the numbers order due to the language start from right to left when the results are return in arabic. i hope you can help in this thank you so much in advance
@AbdelsalamHaa Use jTessBoxEditor https://github.com/nguyenq/jTessBoxEditor by @nguyenq After Box Generating and before training readjust the char corresponding to each box The tool will not accept to enter Arabic numbers so you have to enter the English number The OCR will read the Arabic number but the output will be English number
You can use your Arabic input method to enter Arabic digits, or use the built-in conversion tool. At Character textbox, e.g., enter U+0668
and click the adjacent button twice or press Enter key.
@AbdelsalamHaa 1- Make an Arabic Numbers jpg image 2- In Trainer Tab select the jpg image for training data 3- Set language with the name you want for the tesseract data file 4- Select Make Box File only then run 5- In Box Editor open the jpg image 6- For each box in the image you will find corresponding character in column char (it will be wrong character) 7- Readjust each char with respect to each box (it will not accept Arabic numbers so you had to enter English numbers ) 8- Save 9- Go to Trainer Tab and select Train with Existing Box and run
@nguyenq I tried your method The output of OCR reorganization is the Unicode not the number
@Ahmed-tea Thanks for sharing the training file. I’ve downloaded it but did not know how to add to tesseract training files Can you share any guide ?
@ahmed-tea : is this issue solved?
@ahmed-tea did you succeed to combine arabic numbers and arabic words together ?
@Shreeshrii hello I have some questions:
- what is the best tool to train the engine some language?
- is there a minimum size for the training dataset or image?
we face a problem when we train the OCR on Indian numbers ( ١٢٣٤٥٦٧٨٩٠ ) also, we get a bad result when we try to read an image with a mix of Arabic and Indian numbers paragraph Any suggestions?
@Shreeshrii @zdenop
I have tried different types of fine tuning for adding the numbers but have not had much success. I think that the open source tesseract is missing some key component related to Arabic. We will have to wait till @theraysmith or @jbreiden can investigate and fix this.
On Sun, 16 Dec 2018, 02:48 salemalbadawi <[email protected] wrote:
we face a problem when we train the OCR on Indian numbers ( ١٢٣٤٥٦٧٨٩٠ ) also, we get a bad result when we try to read an image with a mix of Arabic and Indian numbers paragraph Any suggestions?
@Shreeshrii https://github.com/Shreeshrii
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1193#issuecomment-447623980, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxNB2U5JWwrZSnRbFpoQisqr2UYZks5u5frJgaJpZM4QLERQ .
Reports with latest versions:
Arabic-Indic numbers incorrectly recognized #2864
Some Arabic-Indic numbers are being reversed #2897
https://github.com/Shreeshrii/tessdata_arabic this link may help, it helped me a lot.
Is there any Indic-Arabic numeral (only) dataset for training tesseract? images+ground truth
Definition
* **AEN** Arabic Eastern Numbers {ِ123456789} * **AWN** Arabic Western Numbers {0123456789}
I generated an experimental data file for recognaize AEN Only The output of Tesseract OCR will be in the form of AWN
https://github.com/ahmed-tea/tessdata_Arabic_Numbers
@Shreeshrii @theraysmith
it has a wrong link text of link is correct but url embded in link is invalid
https://github.com/jishakrishnan/pytrsseract-arabic - try this out
Hi @Shreeshrii .
I have this example the date
تار يخ السداد
appears like this
تار يخ السداد : 48./١./١٠؟١٠؟
Any suggestion , Thanks