pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

ENH: Optimize getting named destinations

Open larsga opened this issue 6 months ago • 9 comments

This optimization solves the performance issue introduced by PR #3400.

Without this optimization test_merger.py takes 124 seconds on my laptop after PR 3400. With this optimization it takes 6.6 seconds.

Very likely this optimization will benefit other code as well.

larsga avatar Aug 21 '25 08:08 larsga

@stefan6419846 I need some help with this. ruff and Python both are complaining about Dict in my added code, claiming it's not declared. It is imported on line 39, though. And other code uses it in exactly the same way, without ruff or Python complaining. It works for me in both Python and ruff locally. What is going on?

larsga avatar Aug 21 '25 08:08 larsga

Please try to rebase on the latest main - your base appears to be outdated and the corresponding import is gone in main.

stefan6419846 avatar Aug 21 '25 19:08 stefan6419846

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 96.98%. Comparing base (bc318d7) to head (8465c0d). :warning: Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3442   +/-   ##
=======================================
  Coverage   96.97%   96.98%           
=======================================
  Files          54       54           
  Lines        9337     9344    +7     
  Branches     1711     1713    +2     
=======================================
+ Hits         9055     9062    +7     
  Misses        168      168           
  Partials      114      114           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Aug 21 '25 19:08 codecov[bot]

That worked. Thank you! The PR should be ready now.

larsga avatar Aug 21 '25 19:08 larsga

Thanks for the PR. This code seems to assume that the named destinations never change. Is this really a valid assumption or will this create unexpected behavior in specific cases?

stefan6419846 avatar Sep 08 '25 09:09 stefan6419846

It does assume the destinations never change. Originally I put this cache on the shared superclass for readers and writers, but then had the same realization as you. So now this code is only in the PdfReader class. That works for the purposes of the link rewriting code etc, but it does mean that getting named destinations from ´PdfWriter` is as slow as it used to be.

larsga avatar Sep 08 '25 09:09 larsga

Thanks for the explanation - indeed, in the reader it should not be an issue. Are we able to add a test for this?

stefan6419846 avatar Sep 11 '25 11:09 stefan6419846

Hmmm. What sort of tests are you looking for? Tests of changing the named destinations in the reader? I guess not, since that's not supposed to be possible. Tests of changing the named destinations in the writer? It doesn't use this code, but I guess I could add. Or just tests of reading named destinations at all? Those might already exist for all I know, but I could check, once I know what you're thinking.

larsga avatar Sep 11 '25 11:09 larsga

Sorry for not being explicit enough - I am referring to the fact that these are cached. One possible approach would be to check the time for handling many destinations the first and the second time, although I am open for other tests as well.

stefan6419846 avatar Sep 11 '25 11:09 stefan6419846