pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

Not working properly with Arabic language

Open ahmedsaoud31 opened this issue 5 years ago • 9 comments

Tested file: https://drive.google.com/file/d/1B8lfMfSyN3FeGObppf5Ug-Efe87ceEGV/view?usp=sharing

Licence: GNU FDL V1.2

ahmedsaoud31 avatar Jul 03 '20 11:07 ahmedsaoud31

Hi @ahmedsaoud31, what error do you see?

EDIT: Added PDF from the Google Drive. Learn PHP Programing.pdf

k00ni avatar Jul 03 '20 11:07 k00ni

Hi @ahmedsaoud31, what error do you see?

Paragraphs reversed.

ahmedsaoud31 avatar Jul 03 '20 11:07 ahmedsaoud31

Is "Paragraphs reversed." the error you see? Or do you see an exception or is the code stopping somewhere? Can you paste more information please.

k00ni avatar Jul 08 '20 07:07 k00ni

There is no error message or code stopping, but the Arabic text does not appear in its natural form and appears in reverse.

ahmedsaoud31 avatar Jul 09 '20 12:07 ahmedsaoud31

Hi I have not looked at the code at all but this is familiar. Arabic reads right to left, Western languages left to right. I presume what is happening is that when re-encoding the characters into text from the pdf original it is outputting them in the correct order but left to right. Sometimes I get this when copy pasting from an Arabic pdf. To fix this the output would have to be in reverse sequence, last character first. I'm about to try and parse a mixed English/Arabic text, and will probably have to write a helper to do this, if I can't find a better solution.

Leamsi9 avatar Oct 03 '20 16:10 Leamsi9

I would like to develop it for RTL languages. Is it possible? Can you tell me from where I should start?

typeoo avatar Aug 19 '23 18:08 typeoo

@typeoo Thank you for your interest.

How good do you know the PDF specification? You could start by checking the files in https://github.com/smalot/pdfparser/tree/master/src/Smalot/PdfParser/RawData or you go the other way around and checkout https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Page.php#L179 first.

But I am not sure which parts are involved here. Before creating a pull request checkout out https://github.com/smalot/pdfparser/pull/633/files#diff-b2496e80299b8c3150b1944450bd81c622e04e13d15c411d291db0927d75fd6bR16-R27. It is not merged yet, but will be the basis for future pull requests.

@GreyWyvern You are working on many parts of the library currently. Do you have an idea? Maybe there is room to colaborate?

k00ni avatar Aug 21 '23 13:08 k00ni

@GreyWyvern You are working on many parts of the library currently. Do you have an idea? Maybe there is room to colaborate?

Yep, I've looked at this file and as far as I know it doesn't use the usual /ReversedChars command method that is the only thing PdfParser understands. I believe it is specifying the characters in forward (LtR) fashion, but positioning the characters, one by one, in RtL fashion.

Handling this would be a rather large change, as you would have to give Font::decodeText() (and probably other functions in PDFObject.php) the ability to insert characters into a string before characters it has already decoded, just based on the n offset value. Certainly possible, but probably very tricky.

Edit: Apparently there is also a Font matrix which in this case might include a negative scaling value for the Arabic text which would make it appear backwards (the proper direction).

GreyWyvern avatar Aug 21 '23 14:08 GreyWyvern

Unfortunately the same problem persists. This code does not work with Arabic or any other language that reads from right to left. I don't know how to solve the problem

yastoss avatar May 30 '24 19:05 yastoss