BUG: sweep indirect references when cloning from PdfReader object

Open mtd91429 opened this issue 1 year ago • 4 comments

Addresses #1185

Jul 31 '22 20:07 mtd91429

Codecov Report

Merging #1191 (2242dca) into main (0a6676f) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1191   +/-   ##
=======================================
  Coverage   92.10%   92.10%           
=======================================
  Files          24       24           
  Lines        4913     4914    +1     
  Branches     1017     1017           
=======================================
+ Hits         4525     4526    +1     
  Misses        244      244           
  Partials      144      144

Impacted Files	Coverage Δ
PyPDF2/_writer.py	`88.18% <100.00%> (+0.02%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0a6676f...2242dca. Read the comment docs.

Jul 31 '22 20:07 codecov[bot]

Hey, just a short heads-up: I haven't forgotten this PR, but it is one of the more complex ones for me to merge. I'm uncertain about the implications (e.g. on performance) it has. Also, I don't yet understand the problem it solves.

I hope I can take some time on the weekend to review + merge + release it :-)

Aug 04 '22 20:08 MartinThoma

I changed this to a draft.

To assess the impacet, the following code was run off the current main branch (not the current pull request) and on my 12 year old laptop (so YMMV) in Jupyter Lab: timeit_tika-924546

The above screenshot is run on the test PDF tika-924546.pdf.

Using a much larger and complicated document (personal, cannot share) which is ~112 MB and ~4,000 pages, results in significant delays with the _sweep_indirect_references method. Simply executing it once on my device took ~13 minutes (based off the notebook metadata). Again, this is a super old computer, and the simple execution time was too long to bother using the timeit function. However, I'd guess even on current top-of-the-line hardware, there would still be a notable performance lag. In comparison to commercial software on the same device, saving the document after making small annotations took ~5 seconds with PDFXchange viewer and ~10 seconds for Adobe Acrobat (based on me counting in my head, lol). It's hard to know how much is due to performance benefits with pre-compiled code vs algorithm efficiency differences.

So adding _sweep_indirect_references to the clone_document_from_reader process will likely result in performance hits which become more significant with increasingly large and complex documents. The _sweep_indirect_references method is still used for every PdfWrier.write() call, so the performance hit is always present; but adding it here could potentially double the cost.

Aug 05 '22 17:08 mtd91429

Because of the performance issues, it may not be desirable to add _sweep_indirect_methods to the clone_document_from_reader method. Rather, I'll probably generate helper methods similar to the append_pages_from_reader method which addresses the named destinations and outline items and edit this PR.

Aug 05 '22 17:08 mtd91429

How should we continue with this PR?

Dec 10 '22 18:12 MartinThoma

I'm closing this PR now. We had issues with _sweep_indirect_references before and I'm not sure if this PR improves anything.

I'm open to re-opening it again if anybody disagrees with this decision.

Jan 06 '23 18:01 MartinThoma

pypdf pypdf copied to clipboard

BUG: sweep indirect references when cloning from PdfReader object

Codecov Report

pypdf
pypdf copied to clipboard