pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

how to ignore invisible characters?

Open dwalton76 opened this issue 4 years ago • 6 comments

In column 17 of white-X-redacted.pdf

we have white-X-01

but if I click on the page and drag my mouse it highlights 5 invisible Xs: white-X-02

I would like to ignore these invisible Xs but I have yet to find a way to detect that they are invisible.

  • I tried exposing graphicstate but it is <PDFGraphicState: linewidth=0, linecap=None, linejoin=None, miterlimit=None, dash=None, intent=None, flatness=None, stroking color=0, non stroking color=0> for every X
  • ncs is <PDFColorSpace: DeviceGray, ncomponents=1> for every X

Thoughts on how I can detect these invisible Xs are invisible?

dwalton76 avatar Apr 28 '20 12:04 dwalton76

At first inspection of the page content stream it looks like those X glyphs are drawn outside the respectively current clip path.

mkl-public avatar Apr 28 '20 14:04 mkl-public

Is there a way for me to detect that so I can ignore them?

dwalton76 avatar Apr 29 '20 15:04 dwalton76

Did you find a way to ignore these invisible characters? I am having a similar issue on my side.

Belval avatar May 14 '20 17:05 Belval

I have not :(

dwalton76 avatar May 14 '20 18:05 dwalton76

I'm a little late, but yesterday I ran across a similar issue and found a solution for my specific case, at least.

My PDF also contained invisible characters. However, the difference from @dwalton76's PDF is that the invisible characters had a different stroking color. So, for me the solution was:

def filterInvisibleObjects(object):
  # 0 was the color of visible characters
  return object['stroking_color'] == 0

with pdfplumber.open('path/to/file.pdf') as pdf:
  page = pdf.pages[0]
  page = page.filter(filterInvisibleObjects)

  # Procede with the content extraction. The page doesn't contain invisible characters anymore.
  ...

I know it doesn't solve @dwalton76's problem, but I'm leaving this comment here in case it helps someone else.

PS.: Unfortunately, I'm not allowed to post my PDF here because it contains sensitive information from a client.

matheusefagundes avatar Jun 24 '21 20:06 matheusefagundes

Thanks for sharing, @matheusefagundes!

jsvine avatar Jun 25 '21 12:06 jsvine