convert_from_path returns non-ascii characters in some pages of a pdf
Problem: when i tried to split a pdf into multiple pages, i found that in some of the pages data is corrupted. i.e, Though i am able to see corresponding page content clearly using chrome pdf viewer. But the page's output given by convert_from_path looks corruped as shown below.
Due to some sensitive content i cant share the complete pdf
Screenshots
This is page 21 of the pdf:
This is the individual page 21: Which is the output from ''convert_from_path''

Desktop:
- OS: Ubuntu 18.04.4 LTS
- pdf2image - 1.10.0
- python 3.8.0
Additional context
- I used chrome to view the pdf, In which the pdf page 21 appears neat. Where as, when i used ubuntu default "Document Viewer" it shows page 21 as corrupted along with some other pages.
- When i reopened the pdf in ubuntu "Document Viewer" multiple times. I noticed those non-ascii symbols vary there position in the pdf. i.e, Each time i open a new part of the pdf is corrupted.
- Upon observing the result from various sources like:
- the output of text from page21 using tesseract (not clear),
- and Document viewer output of page21(not clear) and
- chromes view of the pdf's page 21.(which is clear)
I am thinking that PDF is no more a Portable Document Format.
Unfortunately, this is probably a by-product of pdfium's (chromium pdf engine) very very "soft" handling of the PDF specifications. I am afraid that unless you can parse the document with pdftoppm -r 200 your_pdf.pdf out I cannot help you.