pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

How to extract text based on a defined region by the user?

Open RitchieP opened this issue 1 year ago • 6 comments

I'm currently working on a project that requires me to extract text on a region of a page defined by the user. The lack of documentation around the code is really making this challenging. I would hope some experts in this library could help guide me through this.

RitchieP avatar Apr 20 '23 07:04 RitchieP

I suggest you to look at PDFPlumber, built on pdfminer.six, which is actively maintained and well documented

QuentinAndre11 avatar Apr 20 '23 13:04 QuentinAndre11

Thanks for the suggestion! I also came across a library called PyMuPDF. Would like to hear thoughts about this library also. And how does it compare with PDFPlumber.

RitchieP avatar Apr 21 '23 13:04 RitchieP

We are working on a pdf extraction tool called hotpdf with full MIT license. It's written on top of pdfminer.six :)

https://github.com/weareprestatech/hotpdf

@RitchieP

krishnasism avatar Feb 22 '24 11:02 krishnasism

@krishnasism I'm actually interested in a memory optimized tool built for larger PDFs on top of pdfminer.six. I have a question tho: I saw that you can extract texts based on a bounding box and you use a memory map to do so. Is it spatially indexed or is it a simple loop like PDFPlumber ?

QuentinAndre11 avatar Feb 22 '24 14:02 QuentinAndre11

It uses an indexed sparse matrix. While retrieving it loops in the range of query

krishnasism avatar Feb 22 '24 14:02 krishnasism

Hm, unfortunately it doesn't fit my use case (I also need to retrieve the curves), but that's interesting

QuentinAndre11 avatar Feb 22 '24 16:02 QuentinAndre11

Been quite some time since I last visited this page and the project was long ago. Thanks @krishnasism for bringing this back up and such an awesome library!

I thought I might as well provide the solution to this issue I originally opened. (I'm sorry if my solution is a bit wrong or outdated since I finished the project long ago)

The steps are below:

  1. I believe I used PyPDF (or some other PDF library) to convert the PDF page into an image.
  2. Pass that image to OpenCV, and draw regions with it. (In my case, it was just rectangles)
  3. After I've drawn the rectangles, I could get their coordinates.
  4. Coordinates are then passed back to pdfminer.six, and there is where I can then extract text based on a region.

RitchieP avatar Feb 26 '24 02:02 RitchieP