pdfparser
pdfparser copied to clipboard
Not working properly with Arabic language
Tested file: https://drive.google.com/file/d/1B8lfMfSyN3FeGObppf5Ug-Efe87ceEGV/view?usp=sharing
Licence: GNU FDL V1.2
Hi @ahmedsaoud31, what error do you see?
EDIT: Added PDF from the Google Drive. Learn PHP Programing.pdf
Hi @ahmedsaoud31, what error do you see?
Paragraphs reversed.
Is "Paragraphs reversed." the error you see? Or do you see an exception or is the code stopping somewhere? Can you paste more information please.
There is no error message or code stopping, but the Arabic text does not appear in its natural form and appears in reverse.
Hi I have not looked at the code at all but this is familiar. Arabic reads right to left, Western languages left to right. I presume what is happening is that when re-encoding the characters into text from the pdf original it is outputting them in the correct order but left to right. Sometimes I get this when copy pasting from an Arabic pdf. To fix this the output would have to be in reverse sequence, last character first. I'm about to try and parse a mixed English/Arabic text, and will probably have to write a helper to do this, if I can't find a better solution.
I would like to develop it for RTL languages. Is it possible? Can you tell me from where I should start?
@typeoo Thank you for your interest.
How good do you know the PDF specification? You could start by checking the files in https://github.com/smalot/pdfparser/tree/master/src/Smalot/PdfParser/RawData or you go the other way around and checkout https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Page.php#L179 first.
But I am not sure which parts are involved here. Before creating a pull request checkout out https://github.com/smalot/pdfparser/pull/633/files#diff-b2496e80299b8c3150b1944450bd81c622e04e13d15c411d291db0927d75fd6bR16-R27. It is not merged yet, but will be the basis for future pull requests.
@GreyWyvern You are working on many parts of the library currently. Do you have an idea? Maybe there is room to colaborate?
@GreyWyvern You are working on many parts of the library currently. Do you have an idea? Maybe there is room to colaborate?
Yep, I've looked at this file and as far as I know it doesn't use the usual /ReversedChars command method that is the only thing PdfParser understands. I believe it is specifying the characters in forward (LtR) fashion, but positioning the characters, one by one, in RtL fashion.
Handling this would be a rather large change, as you would have to give Font::decodeText() (and probably other functions in PDFObject.php) the ability to insert characters into a string before characters it has already decoded, just based on the n offset value. Certainly possible, but probably very tricky.
Edit: Apparently there is also a Font matrix which in this case might include a negative scaling value for the Arabic text which would make it appear backwards (the proper direction).
Unfortunately the same problem persists. This code does not work with Arabic or any other language that reads from right to left. I don't know how to solve the problem