pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

MAINT: Simplify file identifiers generation

Open exiledkingcc opened this issue 11 months ago • 5 comments

exiledkingcc avatar Jul 22 '23 16:07 exiledkingcc

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (ec85a27) 94.54% compared to head (40bb17f) 94.52%. Report is 1 commits behind head on main.

Files Patch % Lines
pypdf/_writer.py 90.00% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2003      +/-   ##
==========================================
- Coverage   94.54%   94.52%   -0.02%     
==========================================
  Files          43       43              
  Lines        7549     7549              
  Branches     1490     1491       +1     
==========================================
- Hits         7137     7136       -1     
  Misses        253      253              
- Partials      159      160       +1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jul 22 '23 16:07 codecov[bot]

What impact do the file identifiers have? Who/what makes use of them?

MartinThoma avatar Jul 29 '23 09:07 MartinThoma

the PDF standard says:

The calculation of the file identifier need not be reproducible; all that matters is that the identifier is likely to be unique. For example, two implementations of the preceding algorithm might use different formats for the current time, causing them to produce different file identifiers for the same file created at the same time, but the uniqueness of the identifier is not affected.

the identifiers are also be used for encryption.

@MartinThoma so i think it's ok to make it simple.

exiledkingcc avatar Jul 29 '23 14:07 exiledkingcc

Having a deterministic way to generate PDFs is valuable to several developers. Does the current deterministic identifier generation cause any issues?

MartinThoma avatar Aug 13 '23 07:08 MartinThoma

first of all, it cost too much for big pdf files. and for aes encrypted pdf, it's not deterministic. when PdfWriter.encrypt called, the identifiers are genearated by uncrypted pdf stream, then PdfWriter.write called, the content of pdf file is encrypted, so the hash changed. for encrypted pdf, identifiers must be generated before write to stream, since the identifier will be used to calculate the key, so the identifiers cannot be the hash of pdf stream content.

exiledkingcc avatar Aug 14 '23 05:08 exiledkingcc