pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

Incorrect output for some non UTF-8 characters

Open iamkhusainov7 opened this issue 2 years ago • 2 comments

  • PHP Version: ^8.0
  • PDFParser Version: ^2.3

Description:

Hello. We are working with documents, that contain Polish characters. Unfortunately, the characters are parsed weirdly for some documents or some parts of the same document.

PDF input

To obtain the pdf, you can contact me by email: [email protected]

Expected output & actual output

expected output: 'ORANGE 100% MOBILITY SPÓŁKA Z OGRANICZONĄ ODPOWIEDZIALNOŚCIĄ' actual output 'b"ORANGE 100yb\x00\x03\x000\x002\x00%\x00,\x00/\x00,\x007\x00<\x00\x03\x006\x003\x00Ï\x00à\x00.\x00$\x00\x03\x00=\x00\x03\x002\x00*\x005\x00$\x001\x00,\x00&\x00=\x002\x001\x01\x04\x00\x03\x002\x00'\x003\x002\x00:\x00,\x00(\x00'\x00=\x00,\x00$\x00/\x001\x002\x01\x1D\x00&\x00,\x01\x04"'

Code

Unfortunately, I am not able to provide the piece of the code, but I am doing it exactly the same as described in the documentation by taking all text content from all pages or for the exact page. The same output is for the usage of the 'getDataTm' method.

iamkhusainov7 avatar Mar 18 '23 08:03 iamkhusainov7

This should be fixed by #627. Please try with the latest version and let us know if it's working for you @iamkhusainov7.

GreyWyvern avatar Aug 21 '23 20:08 GreyWyvern

Hey @GreyWyvern. Sure thing, I will do it this week, as I am on vacation and unable to verify it.

iamkhusainov7 avatar Aug 29 '23 19:08 iamkhusainov7