tesseract
tesseract copied to clipboard
Arabic language (right to left in writing) stored (left to right) after create PDF Searchable
I have tested latest release 3.05 on windows platform to OCR Arabic document to PDF (searchable) and when choose text from output PDF file it seems stored in opposite (left to right) and letters should be stored from (Right to left)!!!
i.e. original text In Arabic is مرحبا Stored in PDF as text as ابحرم
please put your sample file and the command you used for ocr job
This is the command:
tesseract c:\temp\test_ara.jpg -l ara -psm 3 c:\temp\test_ara pdf
Files are attached (source JPG and output PDF)
please check original word أنحاء output inside PDF is ءاحنا
Command and Samples are attached now in the previous comment
Which program are you using to view the PDF?
It does not look reversed wtth Chrome PDF viewer, just not very accurate...
@amitdo is there any way to reach a better accuracy in Arabic language until to change to new engine? now with tesseract i get about 100% accuracy in English but for Arabic result is about 30-40% but for example i checked google drive ocr for Arabic and i see it have 100 results for same image..
can we work on language data for a better results?
I am using Adobe Reader. But please note that words are not reversed while viewing the PDF because it contains the original image with text layer. I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!
This is a serious issue with the PDF output feature using Arabic Language and similar languages that be written from right to left
@roozgar
It seems that Ray is planning to release soon a new version of Tesseract, that will include a new OCR engine based on LSTM.
With LSTM, OCR for printed Arabic (not real handwrite) can reach 95% character accuracy.
"Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.447.4577&rep=rep1&type=pdf
I checked google drive ocr for Arabic and i see it have 100 results for same image..
Neither you or I know what programs they are using to do OCR there...
@tbadran
But please note that words are not reversed while viewing the PDF because it contains the original image with text layer. I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!
Yes, I know...
Here is a copy of the invisible text layer (copied & pasted):
مداها ينم همهما اللغة العريية لغة جهد مه مسنره هي انحاء العالم
Using Chromium (Google browser) PDF viewer under Linux.
Your original jpg image:
I try hard to make sure Arabic and other right-to-left languages work correctly in Tesseract PDF. As the problem is isolated further I'm happy to look, but I'm not aware of any reason things would have broken.
A quick check shows Chrome gives good results (as per amitdo) and Acroread gives bad results (as per tbadran). This is surprising, I thought we were good with Acroread. I wonder if this is a regression and if so when it occurred.
Regarding recognition accuracy, that's a better topic for the forum. But in short: Don't compare against Google Drive. Don't expect major accuracy improvements unless/until Ray is successful with his ideas. And most importantly, don't trust any predictions about 'soon'. That last one is true for all software everywhere.
@roozgar
You can try training Tesseract using the regular engine. Use the the wiki and see #169. I really don't know how good the result will be for Arabic.
Like jbreiden said, the timeline could change...
Please note my testing using the binaries for Windows downloaded from: http://domasofan.spdns.eu/tesseract/ and I am Using Windows 10 with Acrobat Pro 11 to view output PDF file
I have tested multiple different sample files not only sample uploaded above and every time getting same issue in output PDF on windows 10 + Acrobat Pro 11
On OS X, I'm seeing the opposite of earlier reports:
- Acrobat Reader DC 15.10.20056.167417 appears correct when cutting & pasting
- Google Chrome Version 48.0.2564.116 (64-bit) appears backwards
Adobe Acrobat:
امهمه مني اهادم ةييرعلا ةغللا . هم دهج ةغل ملاعلا ءاحنا يه هرنسم
Google Chrome
مداها ينم همهما اللغة العريية لغة جهد مه مسنره هي انحاء العالم
Tom,
Look at the original jpg. Lines 2 and 4 in Google Chrome look quite similar to lines 2 and 3 in the original jpg. First word in line 3 in the original jpg became first word in line 3 in Google Chrome. Clearly, that's the 'good' output...
Again, in Google Chromium. If I mark the first two lines in the PDF + first word in line 3, copy the (invisible) text, paste it to a text file, mark the second to last word in line 3 in the PDF, copy the (invisible) text, paste it to the text file, I get:
مداها ينم همهما اللغة العريية لغة مسنره هي انحاء العالم
I find it a little easier to test with Hebrew because the letters do not connect. Tesseract version 3.03 behaves the same, so this is not a regression. Will need to think about this, because it is not obvious what exactly is going wrong. Lots of PDF files do a crazy 'write it backwards' strategy but that should not be required. Tesseract writes in reading order.
There are two things I can think of doing. One is to give up and write Arabic backwards (which I really hate!). The other is to put an entry in the PDF metadata, Catalog/ViewerPreferences/Direction. Will continue thinking about this, slowly.
@jbreiden I didn't understand you. In one comment you talk about Hebrew and in another one you only referring Arabic. Does Hebrew displayed correctly with Adobe Reader?
Please make sure that any change you do is not causing any regression with Chrome PDF viewer and OS X Preview. Thanks for your work!
@amitdo Hebrew has the exact same problem as Arabic.
Maybe explicitly using unicode bidi control characters can help ?
That's another possibility, thanks for the suggestion.
@jbreiden, any progress? Which way you chose? Personally, I care about our Hebrew support.
I am taking a look at this today. With current code, copy-paste works from Chrome, fails from Adobe Reader. Destination is gEdit. All tests are on Linux. I see no difference in Adobe Reader if I insert U+2067 RIGHT-TO-LEFT ISOLATE (RLI) at the beginning of each word, and U+2069 POP DIRECTIONAL ISOLATE (PDI) at the end of each word. It's possible that my copy of Adobe Reader is too old to understand these control characters. Or that I am using them wrong. Too early to tell.