tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable

Open tbadran opened this issue 9 years ago • 102 comments

I have tested latest release 3.05 on windows platform to OCR Arabic document to PDF (searchable) and when choose text from output PDF file it seems stored in opposite (left to right) and letters should be stored from (Right to left)!!!

i.e. original text In Arabic is مرحبا Stored in PDF as text as ابحرم

tbadran avatar Feb 25 '16 12:02 tbadran

​please put your sample file and the command you used for ocr job​

roozgar avatar Feb 25 '16 12:02 roozgar

This is the command:

tesseract c:\temp\test_ara.jpg -l ara -psm 3 c:\temp\test_ara pdf

Files are attached (source JPG and output PDF)

test_ara test_ara.pdf

please check original word أنحاء output inside PDF is ءاحنا

tbadran avatar Feb 25 '16 12:02 tbadran

Command and Samples are attached now in the previous comment

tbadran avatar Feb 25 '16 13:02 tbadran

Which program are you using to view the PDF?

amitdo avatar Feb 26 '16 18:02 amitdo

It does not look reversed wtth Chrome PDF viewer, just not very accurate...

amitdo avatar Feb 26 '16 18:02 amitdo

@amitdo is there any way to reach a better accuracy in Arabic language until to change to new engine? now with tesseract i get about 100% accuracy in English but for Arabic result is about 30-40% but for example i checked google drive ocr for Arabic and i see it have 100 results for same image..

can we work on language data for a better results?

roozgar avatar Feb 26 '16 18:02 roozgar

I am using Adobe Reader. But please note that words are not reversed while viewing the PDF because it contains the original image with text layer. I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

tbadran avatar Feb 26 '16 19:02 tbadran

This is a serious issue with the PDF output feature using Arabic Language and similar languages that be written from right to left

tbadran avatar Feb 26 '16 19:02 tbadran

@roozgar

It seems that Ray is planning to release soon a new version of Tesseract, that will include a new OCR engine based on LSTM.

With LSTM, OCR for printed Arabic (not real handwrite) can reach 95% character accuracy.

"Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.447.4577&rep=rep1&type=pdf

amitdo avatar Feb 26 '16 20:02 amitdo

I checked google drive ocr for Arabic and i see it have 100 results for same image..

Neither you or I know what programs they are using to do OCR there...

amitdo avatar Feb 26 '16 21:02 amitdo

@tbadran

But please note that words are not reversed while viewing the PDF because it contains the original image with text layer. I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

Yes, I know...

Here is a copy of the invisible text layer (copied & pasted):

مداها ينم همهما اللغة العريية لغة جهد مه مسنره هي انحاء العالم

Using Chromium (Google browser) PDF viewer under Linux.

Your original jpg image: test_ara

amitdo avatar Feb 26 '16 22:02 amitdo

I try hard to make sure Arabic and other right-to-left languages work correctly in Tesseract PDF. As the problem is isolated further I'm happy to look, but I'm not aware of any reason things would have broken.

jbreiden avatar Feb 27 '16 02:02 jbreiden

A quick check shows Chrome gives good results (as per amitdo) and Acroread gives bad results (as per tbadran). This is surprising, I thought we were good with Acroread. I wonder if this is a regression and if so when it occurred.

jbreiden avatar Feb 27 '16 02:02 jbreiden

Regarding recognition accuracy, that's a better topic for the forum. But in short: Don't compare against Google Drive. Don't expect major accuracy improvements unless/until Ray is successful with his ideas. And most importantly, don't trust any predictions about 'soon'. That last one is true for all software everywhere.

jbreiden avatar Feb 27 '16 06:02 jbreiden

@roozgar

You can try training Tesseract using the regular engine. Use the the wiki and see #169. I really don't know how good the result will be for Arabic.

Like jbreiden said, the timeline could change...

amitdo avatar Feb 27 '16 09:02 amitdo

Please note my testing using the binaries for Windows downloaded from: http://domasofan.spdns.eu/tesseract/ and I am Using Windows 10 with Acrobat Pro 11 to view output PDF file

tbadran avatar Feb 27 '16 16:02 tbadran

I have tested multiple different sample files not only sample uploaded above and every time getting same issue in output PDF on windows 10 + Acrobat Pro 11

tbadran avatar Feb 27 '16 16:02 tbadran

On OS X, I'm seeing the opposite of earlier reports:

  • Acrobat Reader DC 15.10.20056.167417 appears correct when cutting & pasting
  • Google Chrome Version 48.0.2564.116 (64-bit) appears backwards

tfmorris avatar Feb 29 '16 19:02 tfmorris

Adobe Acrobat:

امهمه مني اهادم ةييرعلا ةغللا . هم دهج ةغل ملاعلا ءاحنا يه هرنسم

Google Chrome

مداها ينم همهما اللغة العريية لغة جهد مه مسنره هي انحاء العالم

tfmorris avatar Feb 29 '16 19:02 tfmorris

Tom,

Look at the original jpg. Lines 2 and 4 in Google Chrome look quite similar to lines 2 and 3 in the original jpg. First word in line 3 in the original jpg became first word in line 3 in Google Chrome. Clearly, that's the 'good' output...

amitdo avatar Feb 29 '16 22:02 amitdo

Again, in Google Chromium. If I mark the first two lines in the PDF + first word in line 3, copy the (invisible) text, paste it to a text file, mark the second to last word in line 3 in the PDF, copy the (invisible) text, paste it to the text file, I get:

مداها ينم همهما اللغة العريية لغة مسنره هي انحاء العالم

amitdo avatar Feb 29 '16 22:02 amitdo

I find it a little easier to test with Hebrew because the letters do not connect. Tesseract version 3.03 behaves the same, so this is not a regression. Will need to think about this, because it is not obvious what exactly is going wrong. Lots of PDF files do a crazy 'write it backwards' strategy but that should not be required. Tesseract writes in reading order.

jbreiden avatar Mar 01 '16 00:03 jbreiden

There are two things I can think of doing. One is to give up and write Arabic backwards (which I really hate!). The other is to put an entry in the PDF metadata, Catalog/ViewerPreferences/Direction. Will continue thinking about this, slowly.

jbreiden avatar Mar 09 '16 01:03 jbreiden

@jbreiden I didn't understand you. In one comment you talk about Hebrew and in another one you only referring Arabic. Does Hebrew displayed correctly with Adobe Reader?

amitdo avatar Mar 09 '16 09:03 amitdo

Please make sure that any change you do is not causing any regression with Chrome PDF viewer and OS X Preview. Thanks for your work!

amitdo avatar Mar 09 '16 10:03 amitdo

@amitdo Hebrew has the exact same problem as Arabic.

jbreiden avatar Mar 09 '16 22:03 jbreiden

Maybe explicitly using unicode bidi control characters can help ?

amitdo avatar Mar 10 '16 11:03 amitdo

That's another possibility, thanks for the suggestion.

jbreiden avatar Mar 18 '16 18:03 jbreiden

@jbreiden, any progress? Which way you chose? Personally, I care about our Hebrew support.

amitdo avatar Jun 02 '16 08:06 amitdo

I am taking a look at this today. With current code, copy-paste works from Chrome, fails from Adobe Reader. Destination is gEdit. All tests are on Linux. I see no difference in Adobe Reader if I insert U+2067 RIGHT-TO-LEFT ISOLATE (RLI) at the beginning of each word, and U+2069 POP DIRECTIONAL ISOLATE (PDI) at the end of each word. It's possible that my copy of Adobe Reader is too old to understand these control characters. Or that I am using them wrong. Too early to tell.

a

b

c

jbreiden avatar Jul 06 '16 21:07 jbreiden