pdfannots icon indicating copy to clipboard operation
pdfannots copied to clipboard

Use PLAYA and PAVÉS instead of pdfminer.six

Open dhdaines opened this issue 10 months ago • 3 comments

As the subject says! This helped me fix a very annoying bug, so I'm glad I did it.

It's very marginally faster, and you should also get extra robustness to broken PDFs.

Getting it to support the parallel processing that PLAYA can do is a bit more work, but I might give it a try another day...

dhdaines avatar Feb 28 '25 13:02 dhdaines

Also note there are a number of deprecation warnings, that's what the TODO comments are about. Making the required changes is probably pretty straightforward and should simplify the code, so let me know if you want me to do that (but I may also move the deprecated APIs into PAVÉS so that it can more properly emulate pdfminer.six)

dhdaines avatar Feb 28 '25 13:02 dhdaines

Thanks for the PR! This is certainly interesting, given that it's unclear whether anyone is maintaining pdfminer, but I won't rush it. I'll try this out for my own use in the next month to get some confidence with it.

Personally I'm not super interested in parallel analysis -- most of the PDFs I use this with are 10-15 pages long.

0xabu avatar Mar 02 '25 18:03 0xabu

Thanks for the PR! This is certainly interesting, given that it's unclear whether anyone is maintaining pdfminer, but I won't rush it. I'll try this out for my own use in the next month to get some confidence with it.

Sure, take your time! I'm glad I did this anyway, as it helped me find and fix some bugs. If you find some PDFs that fail with pdfminer it would be interesting to try them here.

Personally I'm not super interested in parallel analysis -- most of the PDFs I use this with are 10-15 pages long.

Yeah, after looking at it I realized that the use case of pdfannots isn't at all the kind of huge PDFs that I'm used to dealing with! If somebody had manually annotated hundreds of pages I would be really surprised :-)

I notice there are some build failures which are probably just a configuration thing in the workflows as the dependencies are not being pulled in for mypy properly.

dhdaines avatar Mar 02 '25 19:03 dhdaines