pdfparser Incorrect output for some non UTF-8 characters

PHP Version: ^8.0
PDFParser Version: ^2.3

Description:

Hello. We are working with documents, that contain Polish characters. Unfortunately, the characters are parsed weirdly for some documents or some parts of the same document.

PDF input

To obtain the pdf, you can contact me by email: [email protected]

Expected output & actual output

expected output: 'ORANGE 100% MOBILITY SPÓŁKA Z OGRANICZONĄ ODPOWIEDZIALNOŚCIĄ' actual output 'b"ORANGE 100yb\x00\x03\x000\x002\x00%\x00,\x00/\x00,\x007\x00<\x00\x03\x006\x003\x00Ï\x00à\x00.\x00$\x00\x03\x00=\x00\x03\x002\x00*\x005\x00$\x001\x00,\x00&\x00=\x002\x001\x01\x04\x00\x03\x002\x00'\x003\x002\x00:\x00,\x00(\x00'\x00=\x00,\x00$\x00/\x001\x002\x01\x1D\x00&\x00,\x01\x04"'

Code

Unfortunately, I am not able to provide the piece of the code, but I am doing it exactly the same as described in the documentation by taking all text content from all pages or for the exact page. The same output is for the usage of the 'getDataTm' method.

Mar 18 '23 08:03 iamkhusainov7

This should be fixed by #627. Please try with the latest version and let us know if it's working for you @iamkhusainov7.

Aug 21 '23 20:08 GreyWyvern

Hey @GreyWyvern. Sure thing, I will do it this week, as I am on vacation and unable to verify it.

Aug 29 '23 19:08 iamkhusainov7