pdfplumber
pdfplumber copied to clipboard
how to ignore invisible characters?
In column 17 of white-X-redacted.pdf
we have
but if I click on the page and drag my mouse it highlights 5 invisible Xs:
I would like to ignore these invisible Xs but I have yet to find a way to detect that they are invisible.
- I tried exposing
graphicstate
but it is<PDFGraphicState: linewidth=0, linecap=None, linejoin=None, miterlimit=None, dash=None, intent=None, flatness=None, stroking color=0, non stroking color=0>
for every X -
ncs
is<PDFColorSpace: DeviceGray, ncomponents=1>
for every X
Thoughts on how I can detect these invisible Xs are invisible?
At first inspection of the page content stream it looks like those X glyphs are drawn outside the respectively current clip path.
Is there a way for me to detect that so I can ignore them?
Did you find a way to ignore these invisible characters? I am having a similar issue on my side.
I have not :(
I'm a little late, but yesterday I ran across a similar issue and found a solution for my specific case, at least.
My PDF also contained invisible characters. However, the difference from @dwalton76's PDF is that the invisible characters had a different stroking color. So, for me the solution was:
def filterInvisibleObjects(object):
# 0 was the color of visible characters
return object['stroking_color'] == 0
with pdfplumber.open('path/to/file.pdf') as pdf:
page = pdf.pages[0]
page = page.filter(filterInvisibleObjects)
# Procede with the content extraction. The page doesn't contain invisible characters anymore.
...
I know it doesn't solve @dwalton76's problem, but I'm leaving this comment here in case it helps someone else.
PS.: Unfortunately, I'm not allowed to post my PDF here because it contains sensitive information from a client.
Thanks for sharing, @matheusefagundes!