pdfminer.six
pdfminer.six copied to clipboard
Invisible character handling
Bug report
How to handle invisible text from the pdf
attached pdf has the invisible/ hidden text but in out layout it will come
below my out put
original
@damo1808 I had this issue as well with PDF files. Couldn't find a workaround via pdfminer.
You can try running this through ghostscript to effectively remove the invisible text from the pdf.
Here's the code snippet, you can read the documentation to learn more.
gswin64 -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -dQUIET -dSAFER -sOutputFile=out.pdf backup.pdf
You'll need to install Ghostscript before running the command https://www.ghostscript.com/
@damo1808 can you share the code and version of pdfminer.six you are using?
There were some fixes for this in https://github.com/pdfminer/pdfminer.six/pull/689.
I realize now that this issue is about actual text that is invisible due to the font color (or something else). The linked PR #689 is about whitespace text that is invisible due its nature. So they are probably unrelated.
The screenshot shared by @damo1808 also suggests that this has to do with the html. I can replicate this issue now.
$ python tools/pdf2txt.py ~/Downloads/backup.pdf --output_type html | grep "90 6"
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:332px; top:299px; width:35px; height:14px;"><span style="font-family: SW-Folder-Regular; font-size:14px">90 6
The output should also include the color of the text to fix this issue.
My advice would be convert the page to html and run logic off of that to remove white text and text smaller than a specific size