pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Polygons other than rects for crop (etc)

Open pseudomonas opened this issue 9 months ago • 3 comments

I'm working with OCR'ed scans of historical documents where often the blocks of text have been rotated by a small amount (usually less than 5°) during the scanning process.

If the columns were originally printed straight, column detection along with rotation detection yields parallelograms. If the columns were printed wonky, then some other kind of polygon results from detecting the block of text.

So, what I'd like is to be able to specify (SVG-style) a list of coordinates [(x₀,y₀), (x₁,y₁), … (xₙ,yₙ)] that specify a closed polygon, and then to be able to select only the [characters|words] that fall [fully|partially] within that polygon as per pdfplumber's current tools for cropboxes.

An alternative might be providing a bitmap mask the same shape as the page - I think that I could reasonably easily use a third-party SVG-rendering package to generate such a thing.

pseudomonas avatar Sep 29 '23 08:09 pseudomonas

Hi @pseudomonas, and thanks for the intriguing suggestion. Do you have any interest in developing a PR for this feature? If so, I'd be happy to discuss a general strategy with you.

jsvine avatar Oct 04 '23 16:10 jsvine

I can give it a try. I see that there are various packages with an "is point within polygon" things so I could probably hack together something using the .filter method that tests each candidate object against the polygon. Not sure what performance would be like or what you think about extra dependencies.

My initial project I found I could get away with just increasing the size of the boxes a little bit to allow for rotation, and then filtering any stray characters out of the output later.

pseudomonas avatar Oct 04 '23 16:10 pseudomonas

Thanks, @pseudomonas! Given the niche-ness of this feature, I'm reluctant to add another required dependency, but I could see adding an optional dependency for this — something like:

def within_path(self, svg_style_path: list[tuple[int, int]]) -> DerivedPage:
  try:
    import name_of_dependency
  except ImportError:
    sys.stderr.write("Please install name_of_dependency to use .within_path; exiting.\n")
    exit()
  [actual logic]

jsvine avatar Oct 06 '23 12:10 jsvine