langdata icon indicating copy to clipboard operation
langdata copied to clipboard

Add Indic numerals and missing punctuation to Arabic

Open mustafa0x opened this issue 5 years ago • 4 comments

Previously: #71 and https://github.com/tesseract-ocr/tessdata_best/issues/11 (also contains a pertinent discussion on how well the different traineddata deal with these characters).

• Indic numerals: (٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩) • Punctuation: (؛, ،, ﴿﴾) • Also, a ligature very commonly found in Arabic texts: ﷺ

If I can do this myself please simply point me the way.

CC @Shreeshrii

mustafa0x avatar Jul 12 '18 07:07 mustafa0x

Please see https://github.com/tesseract-ocr/tesseract/issues/2263#issuecomment-466675793 and test if the traineddata files linked there add all the required characters.

Shreeshrii avatar Feb 23 '19 19:02 Shreeshrii

Is this fixed? I've tried the latest version and it didn't detect any Indic numerals.

wewark avatar Feb 11 '20 17:02 wewark

@wewark you have to use Arabic.traineddata file. It recognizes arabic, English letters and Arabic-Indic and Arabic numbers

ShroukMansour avatar Feb 20 '20 03:02 ShroukMansour

@ShroukMansour I use ara.traindata and texts not accuracy also numbers have no accuracy . Is there a solution for this ?

AhmedElsayedTaha avatar Feb 03 '21 14:02 AhmedElsayedTaha