pdfplumber Use PAVÉS instead of pdfminer

This supersedes #1226 (alas, all was in vain). PAVÉS is a library that (among other things) uses PLAYA-PDF to provide a mostly drop-in replacement for pdfminer.six, minus a number of bugs and limitations.

It is somewhat faster, and can also use multiple CPUs, though this PR doesn't do that as it isn't totally clear how to fit that into pdfplumber, though I will take a look at it when I get a minute.

By contrast to #1226 this means that you can still use custom LAParams for instance. But you still get marked content sections, color spaces that make sense, etc.

Feb 07 '25 18:02 dhdaines

This is unfortunately a bit slower than pdfminer.six, in part because of the overhead of making a zillion useless LTChar and other objects before creating the final pdfplumber objects, but also because it adds some extra information that pdfminer.six was incapable of supplying.

Running time pdfplumber ../PDF32000_2008.pdf >/dev/null (that's the 756-page PDF 1.7 standard) on a fairly slow computer (Core i7-860 circa 2012), I get these results.

Using pdfminer.six (current develop branch):

real    5m5.912s
user    4m59.665s
sys     0m6.212s

Using PLAYA (branch in #1226):

real    4m32.255s
user    4m26.192s
sys     0m6.023s

Using PAVÉS (this branch):

real    5m20.015s
user    5m13.360s
sys     0m6.607s

We could definitely optimize this by using the code from #1226 in the case where there are no custom LAParams to worry about (since we have PLAYA already anyway).

Sadly I checked and there is no easy way to support the parallelism of PLAYA and PAVÉS with the current pdfplumber interface, otherwise it could be 2-3x faster.

Feb 07 '25 20:02 dhdaines

This is unfortunately a bit slower than pdfminer.six ... We could definitely optimize this by using the code from #1226 in the case where there are no custom LAParams to worry about (since we have PLAYA already anyway).

Okay, I did that and now it is faster again:

real    4m34.727s
user    4m28.572s
sys     0m6.123s

I should however mention that running the pdfplumber CLI on a big document like that is horrendously memory-inefficient since it processes all the pages at once before printing any results. I might make another PR for that (which could, potentially, use the parallelism in PLAYA).

Feb 07 '25 21:02 dhdaines

Impressive! Do you think there's an approach to implementing this where pdfminer.six and paves could be interchangeable? I.e., the user could select which engine to use?

Feb 11 '25 03:02 jsvine

Impressive! Do you think there's an approach to implementing this where pdfminer.six and paves could be interchangeable? I.e., the user could select which engine to use?

Hmm, this wouldn't be terribly complicated given that the API is nearly identical - if you want to guarantee backward compatibility this would be the right way to go for the moment.

Ultimately I think PAVÉS could support the same subset of the pdfminer.six API using something else (something faster) under the hood such as pypdfium2 - I think this might be possible with the get_objects method: https://pypdfium2.readthedocs.io/en/v4/python_api.html#pypdfium2._helpers.page.PdfPage.get_objects

Feb 13 '25 14:02 dhdaines