pypdf
pypdf copied to clipboard
MAINT: Simplify file identifiers generation
Codecov Report
Attention: 1 lines
in your changes are missing coverage. Please review.
Comparison is base (
ec85a27
) 94.54% compared to head (40bb17f
) 94.52%. Report is 1 commits behind head on main.
Files | Patch % | Lines |
---|---|---|
pypdf/_writer.py | 90.00% | 0 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2003 +/- ##
==========================================
- Coverage 94.54% 94.52% -0.02%
==========================================
Files 43 43
Lines 7549 7549
Branches 1490 1491 +1
==========================================
- Hits 7137 7136 -1
Misses 253 253
- Partials 159 160 +1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
What impact do the file identifiers have? Who/what makes use of them?
the PDF standard says:
The calculation of the file identifier need not be reproducible; all that matters is that the identifier is likely to be unique. For example, two implementations of the preceding algorithm might use different formats for the current time, causing them to produce different file identifiers for the same file created at the same time, but the uniqueness of the identifier is not affected.
the identifiers are also be used for encryption.
@MartinThoma so i think it's ok to make it simple.
Having a deterministic way to generate PDFs is valuable to several developers. Does the current deterministic identifier generation cause any issues?
first of all, it cost too much for big pdf files.
and for aes encrypted pdf, it's not deterministic.
when PdfWriter.encrypt
called, the identifiers are genearated by uncrypted pdf stream,
then PdfWriter.write
called, the content of pdf file is encrypted, so the hash changed.
for encrypted pdf, identifiers must be generated before write to stream, since the identifier will be used to calculate the key,
so the identifiers cannot be the hash of pdf stream content.