Incorrect output for some non UTF-8 characters
- PHP Version: ^8.0
- PDFParser Version: ^2.3
Description:
Hello. We are working with documents, that contain Polish characters. Unfortunately, the characters are parsed weirdly for some documents or some parts of the same document.
PDF input
To obtain the pdf, you can contact me by email: [email protected]
Expected output & actual output
expected output: 'ORANGE 100% MOBILITY SPÓŁKA Z OGRANICZONĄ ODPOWIEDZIALNOŚCIĄ' actual output 'b"ORANGE 100yb\x00\x03\x000\x002\x00%\x00,\x00/\x00,\x007\x00<\x00\x03\x006\x003\x00Ï\x00à\x00.\x00$\x00\x03\x00=\x00\x03\x002\x00*\x005\x00$\x001\x00,\x00&\x00=\x002\x001\x01\x04\x00\x03\x002\x00'\x003\x002\x00:\x00,\x00(\x00'\x00=\x00,\x00$\x00/\x001\x002\x01\x1D\x00&\x00,\x01\x04"'
Code
Unfortunately, I am not able to provide the piece of the code, but I am doing it exactly the same as described in the documentation by taking all text content from all pages or for the exact page. The same output is for the usage of the 'getDataTm' method.
This should be fixed by #627. Please try with the latest version and let us know if it's working for you @iamkhusainov7.
Hey @GreyWyvern. Sure thing, I will do it this week, as I am on vacation and unable to verify it.