pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Clipping paths implementation

Open kelvin0 opened this issue 5 years ago • 6 comments

Hi Everyone,

I've been using Pdfminer for the last few months, I really thing it's a very helpful codebase.

But recently I noticed that clipping paths do not seem to be implemented, I inspected: \pdfminer\pdfinterp.py

# clip
def do_W(self):
    return

# clip-even-odd
def do_W_a(self):
    return

The effect of this is that ALL text is extracted from the PDF, even text that should not be visible (since it should be clipped).

I am not a PDF expert but I can surely help implement the following features:

  • Implement do_W and do_W_a
  • Add an option (laparams?) to force text extraction regardless of clipping paths (as occurs today)

Hope I can clarify this and be able to contribute to the project if necessary.

kelvin0 avatar Apr 10 '20 14:04 kelvin0

Hi @kelvin0, are you experiencing problems due to this issue? I assume that the clipping operator is more often used to exclude parts of a drawing, than it being used to exclude part of the text. Anyway, it would be nice to have a pdf to test this on.

If you want to start implementing this, have a look at section 4.4.3 of the pdf reference manual.

You should also adjust the PDFGraphicsState class. I think it is wise to assess the impact that adding the clipping path to PDFGraphicsState could have on all the other graphics-state aware operators.

pietermarsman avatar May 09 '20 12:05 pietermarsman

Clipping path is indeed used to hide text in PDF documents, here is an example that could be used as a starting point: https://mva.maryland.gov/Documents/VR-181.pdf

There is hidden text slightly above the "VR-181 (03-18)". I was able to extract it properly with pdfbox, but not with pdfminer as path clipping is not supported.

Belval avatar May 15 '20 13:05 Belval

Feel free to create a PR. I can do reviews and merge it when ready.

I don't mind if the first implementation only focusses on adding clipping-path behaviour and ignoring additional top-level arguments for enabling/disabling the behavior. We can create another issue for that, if needed.

pietermarsman avatar May 16 '20 14:05 pietermarsman

@kelvin0 Just a quick bump on this issue as we're trying to sort through them. Are you still willing to work on this? As commented above, a PR would be appreciated if you're still interested and able to.

jstockwin avatar Jul 09 '20 14:07 jstockwin