pdfminer.six Invisible character handling

Bug report

How to handle invisible text from the pdf

attached pdf has the invisible/ hidden text but in out layout it will come

below my out put

Screenshot 2022-06-30 at 9 37 24 PM

original Screenshot 2022-06-30 at 9 37 17 PM

backup.pdf

Jun 30 '22 16:06 damo1808

@damo1808 I had this issue as well with PDF files. Couldn't find a workaround via pdfminer. You can try running this through ghostscript to effectively remove the invisible text from the pdf. Here's the code snippet, you can read the documentation to learn more. gswin64 -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -dQUIET -dSAFER -sOutputFile=out.pdf backup.pdf

You'll need to install Ghostscript before running the command https://www.ghostscript.com/

Jul 18 '22 11:07 krishnasism

@damo1808 can you share the code and version of pdfminer.six you are using?

There were some fixes for this in https://github.com/pdfminer/pdfminer.six/pull/689.

Aug 08 '22 20:08 pietermarsman

I realize now that this issue is about actual text that is invisible due to the font color (or something else). The linked PR #689 is about whitespace text that is invisible due its nature. So they are probably unrelated.

The screenshot shared by @damo1808 also suggests that this has to do with the html. I can replicate this issue now.

$ python tools/pdf2txt.py ~/Downloads/backup.pdf --output_type html | grep "90 6"
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:332px; top:299px; width:35px; height:14px;"><span style="font-family: SW-Folder-Regular; font-size:14px">90 6

The output should also include the color of the text to fix this issue.

Aug 09 '22 16:08 pietermarsman

My advice would be convert the page to html and run logic off of that to remove white text and text smaller than a specific size

Aug 25 '22 14:08 pettzilla1

pdfminer.six pdfminer.six copied to clipboard

Invisible character handling

pdfminer.six
pdfminer.six copied to clipboard