Use PAVÉS instead of pdfminer
This supersedes #1226 (alas, all was in vain). PAVÉS is a library that (among other things) uses PLAYA-PDF to provide a mostly drop-in replacement for pdfminer.six, minus a number of bugs and limitations.
It is somewhat faster, and can also use multiple CPUs, though this PR doesn't do that as it isn't totally clear how to fit that into pdfplumber, though I will take a look at it when I get a minute.
By contrast to #1226 this means that you can still use custom LAParams for instance. But you still get marked content sections, color spaces that make sense, etc.
This is unfortunately a bit slower than pdfminer.six, in part because of the overhead of making a zillion useless LTChar and other objects before creating the final pdfplumber objects, but also because it adds some extra information that pdfminer.six was incapable of supplying.
Running time pdfplumber ../PDF32000_2008.pdf >/dev/null (that's the 756-page PDF 1.7 standard) on a fairly slow computer (Core i7-860 circa 2012), I get these results.
Using pdfminer.six (current develop branch):
real 5m5.912s
user 4m59.665s
sys 0m6.212s
Using PLAYA (branch in #1226):
real 4m32.255s
user 4m26.192s
sys 0m6.023s
Using PAVÉS (this branch):
real 5m20.015s
user 5m13.360s
sys 0m6.607s
We could definitely optimize this by using the code from #1226 in the case where there are no custom LAParams to worry about (since we have PLAYA already anyway).
Sadly I checked and there is no easy way to support the parallelism of PLAYA and PAVÉS with the current pdfplumber interface, otherwise it could be 2-3x faster.
This is unfortunately a bit slower than
pdfminer.six... We could definitely optimize this by using the code from #1226 in the case where there are no customLAParamsto worry about (since we have PLAYA already anyway).
Okay, I did that and now it is faster again:
real 4m34.727s
user 4m28.572s
sys 0m6.123s
I should however mention that running the pdfplumber CLI on a big document like that is horrendously memory-inefficient since it processes all the pages at once before printing any results. I might make another PR for that (which could, potentially, use the parallelism in PLAYA).
Impressive! Do you think there's an approach to implementing this where pdfminer.six and paves could be interchangeable? I.e., the user could select which engine to use?
Impressive! Do you think there's an approach to implementing this where
pdfminer.sixandpavescould be interchangeable? I.e., the user could select which engine to use?
Hmm, this wouldn't be terribly complicated given that the API is nearly identical - if you want to guarantee backward compatibility this would be the right way to go for the moment.
Ultimately I think PAVÉS could support the same subset of the pdfminer.six API using something else (something faster) under the hood such as pypdfium2 - I think this might be possible with the get_objects method: https://pypdfium2.readthedocs.io/en/v4/python_api.html#pypdfium2._helpers.page.PdfPage.get_objects