pypdf
pypdf copied to clipboard
BUG: sweep indirect references when cloning from PdfReader object
Addresses #1185
Codecov Report
Merging #1191 (2242dca) into main (0a6676f) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #1191 +/- ##
=======================================
Coverage 92.10% 92.10%
=======================================
Files 24 24
Lines 4913 4914 +1
Branches 1017 1017
=======================================
+ Hits 4525 4526 +1
Misses 244 244
Partials 144 144
Impacted Files | Coverage Δ | |
---|---|---|
PyPDF2/_writer.py | 88.18% <100.00%> (+0.02%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 0a6676f...2242dca. Read the comment docs.
Hey, just a short heads-up: I haven't forgotten this PR, but it is one of the more complex ones for me to merge. I'm uncertain about the implications (e.g. on performance) it has. Also, I don't yet understand the problem it solves.
I hope I can take some time on the weekend to review + merge + release it :-)
I changed this to a draft.
To assess the impacet, the following code was run off the current main branch (not the current pull request) and on my 12 year old laptop (so YMMV) in Jupyter Lab:
The above screenshot is run on the test PDF tika-924546.pdf.
Using a much larger and complicated document (personal, cannot share) which is ~112 MB and ~4,000 pages, results in significant delays with the _sweep_indirect_references
method. Simply executing it once on my device took ~13 minutes (based off the notebook metadata). Again, this is a super old computer, and the simple execution time was too long to bother using the timeit
function. However, I'd guess even on current top-of-the-line hardware, there would still be a notable performance lag. In comparison to commercial software on the same device, saving the document after making small annotations took ~5 seconds with PDFXchange viewer and ~10 seconds for Adobe Acrobat (based on me counting in my head, lol). It's hard to know how much is due to performance benefits with pre-compiled code vs algorithm efficiency differences.
So adding _sweep_indirect_references
to the clone_document_from_reader
process will likely result in performance hits which become more significant with increasingly large and complex documents. The _sweep_indirect_references
method is still used for every PdfWrier.write()
call, so the performance hit is always present; but adding it here could potentially double the cost.
Because of the performance issues, it may not be desirable to add _sweep_indirect_methods
to the clone_document_from_reader
method. Rather, I'll probably generate helper methods similar to the append_pages_from_reader
method which addresses the named destinations and outline items and edit this PR.
How should we continue with this PR?
I'm closing this PR now. We had issues with _sweep_indirect_references
before and I'm not sure if this PR improves anything.
I'm open to re-opening it again if anybody disagrees with this decision.