pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Use PLAYA instead of pdfminer

Open dhdaines opened this issue 1 year ago • 6 comments

So... I went ahead and rewrote large parts of pdfminer.six, because I kept having nightmares about being back in Software Engineering 101 every time I looked at its code. The result is PLAYA, which does less stuff than pdfminer.six but I believe does it somewhat better (and about 20% faster).

This PR uses it, and also as a consequence fixes a few longstanding issues due to pdfminer's quirks. Some of these quirks have not been fixed yet (e.g. the placement of things relative to the MediaBox, lack of actual support for pattern color spaces) but should be soon.

On the downside, LAParams no longer exists and thus cannot be used. What it actually did was mostly just change the ordering of items in the page, and do some heuristic detection of whitespace in text, replicating things that pdfplumber was already doing. (in general this is true of all the "layout analysis" pdfminer did)

I have tried to keep the API reasonable and compact so that it could ultimately be reimplemented on some other PDF parser. Note however that the API is subject to change - this PR is using the "eager" API which is kind of custom made for pdfplumber and also retains some pdfplumber quirks, and thus might not stick around.

Do not merge this, for obvious reasons! It's here in case you or anyone somehow feel the desire to play with it.

dhdaines avatar Nov 20 '24 05:11 dhdaines

Fascinating! Thank you for sharing. An idle thought: What if pdfplumber could allow users to choose their parsing backend? Would require pdfplumber to develop some additional abstractions, but might be a neat way to support more experimentation like this.

jsvine avatar Nov 20 '24 12:11 jsvine

Fascinating! Thank you for sharing. An idle thought: What if pdfplumber could allow users to choose their parsing backend? Would require pdfplumber to develop some additional abstractions, but might be a neat way to support more experimentation like this.

This wouldn't be terribly hard to do - it would be a useful exercise as some of the representations used by pdfplumber are inadvertently specific to pdfminer.six. I think it would be worthwhile for pdfplumber to explicitly define its data models, whether it's with pydantic or something else (you could just make a JSON Schema for instance).

The goal of PLAYA is just to be a Pythonic and lazy wrapper around the internals of PDF, obviously pdfplumber (and Camelot, and unstructured.io, and etc, and etc, ...) are what you want for actual information extraction.

(I will probably change the recursive acronym to PLAYA is a LAzY Analyzer for PDF 🤣)

dhdaines avatar Nov 20 '24 14:11 dhdaines

I may wish to promote this to a real PR shortly (awaiting a release of PLAYA that will fix a couple important bugs).

PLAYA is much more robust to borken PDFs than pdfminer.six, supports color spaces and patterns more correctly, and is also significantly faster.

For a 486-page PDF document, running extract_words using pdfminer takes 1:46 minutes on my (old) computer.

With PLAYA it takes 1:16 minutes ... a 28% speedup!

dhdaines avatar Dec 13 '24 14:12 dhdaines

Really neat to see you developing this so rapidly, and great to hear about that speedup.

jsvine avatar Dec 16 '24 03:12 jsvine

Really neat to see you developing this so rapidly, and great to hear about that speedup.

Thanks! I keep on finding interesting bugs in pdfminer.six, unfortunately... these ones are fixed in PLAYA:

https://github.com/pdfminer/pdfminer.six/issues/1065 https://github.com/pdfminer/pdfminer.six/issues/1067

This one isn't yet (and it's kind of nasty since it causes text extraction to simply fail silently on some files):

https://github.com/pdfminer/pdfminer.six/issues/1072

dhdaines avatar Dec 16 '24 04:12 dhdaines

Really neat to see you developing this so rapidly, and great to hear about that speedup.

Thanks! I keep on finding interesting bugs in pdfminer.six, unfortunately...

they're all fixed now :) but you should look at #1272 instead!

dhdaines avatar Feb 07 '25 19:02 dhdaines