convert_from_path returns non-ascii characters in some pages of a pdf

Open venkat-amballa opened this issue 5 years ago • 1 comments

Problem: when i tried to split a pdf into multiple pages, i found that in some of the pages data is corrupted. i.e, Though i am able to see corresponding page content clearly using chrome pdf viewer. But the page's output given by convert_from_path looks corruped as shown below.

Due to some sensitive content i cant share the complete pdf

Screenshots This is page 21 of the pdf: content_page_21 This is the individual page 21: Which is the output from ''convert_from_path'' garbage_page_21

Desktop:

OS: Ubuntu 18.04.4 LTS
pdf2image - 1.10.0
python 3.8.0

Additional context

I used chrome to view the pdf, In which the pdf page 21 appears neat. Where as, when i used ubuntu default "Document Viewer" it shows page 21 as corrupted along with some other pages.
When i reopened the pdf in ubuntu "Document Viewer" multiple times. I noticed those non-ascii symbols vary there position in the pdf. i.e, Each time i open a new part of the pdf is corrupted.
Upon observing the result from various sources like:
- the output of text from page21 using tesseract (not clear),
- and Document viewer output of page21(not clear) and
- chromes view of the pdf's page 21.(which is clear)

I am thinking that PDF is no more a Portable Document Format.

Mar 31 '20 11:03 venkat-amballa

Unfortunately, this is probably a by-product of pdfium's (chromium pdf engine) very very "soft" handling of the PDF specifications. I am afraid that unless you can parse the document with pdftoppm -r 200 your_pdf.pdf out I cannot help you.

Mar 31 '20 17:03 Belval