pdfminer.six
pdfminer.six copied to clipboard
How to extract text based on a defined region by the user?
I'm currently working on a project that requires me to extract text on a region of a page defined by the user. The lack of documentation around the code is really making this challenging. I would hope some experts in this library could help guide me through this.
I suggest you to look at PDFPlumber, built on pdfminer.six, which is actively maintained and well documented
Thanks for the suggestion! I also came across a library called PyMuPDF. Would like to hear thoughts about this library also. And how does it compare with PDFPlumber.
We are working on a pdf extraction tool called hotpdf with full MIT license. It's written on top of pdfminer.six :)
https://github.com/weareprestatech/hotpdf
@RitchieP
@krishnasism I'm actually interested in a memory optimized tool built for larger PDFs on top of pdfminer.six. I have a question tho: I saw that you can extract texts based on a bounding box and you use a memory map to do so. Is it spatially indexed or is it a simple loop like PDFPlumber ?
It uses an indexed sparse matrix. While retrieving it loops in the range of query
Hm, unfortunately it doesn't fit my use case (I also need to retrieve the curves), but that's interesting
Been quite some time since I last visited this page and the project was long ago. Thanks @krishnasism for bringing this back up and such an awesome library!
I thought I might as well provide the solution to this issue I originally opened. (I'm sorry if my solution is a bit wrong or outdated since I finished the project long ago)
The steps are below:
- I believe I used PyPDF (or some other PDF library) to convert the PDF page into an image.
- Pass that image to OpenCV, and draw regions with it. (In my case, it was just rectangles)
- After I've drawn the rectangles, I could get their coordinates.
- Coordinates are then passed back to
pdfminer.six
, and there is where I can then extract text based on a region.