pypdf
pypdf copied to clipboard
BUG/ENH: make attachements compatible with kids, and allow list in RF
closes #2087 closes #2090
add also compatibility with RF (adding list)
still in progress
Codecov Report
Attention: 170 lines
in your changes are missing coverage. Please review.
Comparison is base (
5a2dd75
) 94.45% compared to head (ab96331
) 92.53%.
Additional details and impacted files
@@ Coverage Diff @@
## main #2197 +/- ##
==========================================
- Coverage 94.45% 92.53% -1.93%
==========================================
Files 43 43
Lines 7650 7930 +280
Branches 1511 1576 +65
==========================================
+ Hits 7226 7338 +112
- Misses 262 393 +131
- Partials 162 199 +37
Files | Coverage Δ | |
---|---|---|
pypdf/_protocols.py | 100.00% <ø> (ø) |
|
pypdf/_reader.py | 91.53% <100.00%> (-0.17%) |
:arrow_down: |
pypdf/constants.py | 100.00% <100.00%> (ø) |
|
pypdf/generic/__init__.py | 100.00% <ø> (ø) |
|
pypdf/generic/_base.py | 99.33% <71.42%> (-0.67%) |
:arrow_down: |
pypdf/_writer.py | 86.91% <58.62%> (-1.47%) |
:arrow_down: |
pypdf/generic/_data_structures.py | 80.62% <46.46%> (-11.29%) |
:arrow_down: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@MartinThoma progressing well on this PR (still some test to ensure proper coverage) 😋
@MartinThoma I have a dilemma, The way attachments are handled has been misunterstood (and I think I have to plead guilty) for reference, information are mainly available in Pdf 1.7 reference §3.10 (p178) ; the attachements are stored in a name tree : there can not be any duplicates in most of the attachement is stored in EF key : in such case there is only one file/data if RF key are used, there is a list of file/data indexed by filenames (and in such cases, an EF file is also required)
therefore attachements should return a dictionary where keys are the names in the dictionnary the value should be an list of a single extended bytes with the following functions added: indirect_reference /property/ : will point to the File Specification Object name /property/ : returns the preferred file name. get_file_stream( subfile:str="") : return the embedded file stream structure associated with the subfile name (default is EF entry) within_page: (reserved for extension) return the page object containing the attachment if within an annotation else returns None
the point to output the return within a list is here to not change the interface. based on the better understanding of the spec, a list can not have more than on entry.
I'm sorry, I don't understand the dilemma. What are the options we have?
a list can not have more than on entry.
I don't understand. Can you give me a pseudo-code example?
I'm sorry, I don't understand the dilemma. What are the options we have?
a list can not have more than on entry.
I don't understand. Can you give me a pseudo-code example?
As the key in the name tree are uniques, the attachements can only return one file specification stored in /EF
field. therefore the list always have only one item which would be a bytes
object with some extended properties/functions.
/RF
entries (not originally adressed) should return a dictionary where the filenames are the keys. list are meaningless here too. this cannot be easily stored in a extended bytes
object.
my dilemma is: a) is it ok to change the API to return either bytes or dictionnary for the RF entries? a) should take opportunity to remove the overall useless list
we will get the following interface:
class AttachmentBytes(bytes):
...
class PdfReader: # the same for PdfWriter
...
@property
def attachments(self) -> Mapping[str, Union[AttachmentBytes, Dict[str, AttachmentBytes]]]:
...
The current signature is
def attachments(self) -> Mapping[str, List[bytes]]
What is AttachmentBytes
? In which way would you extend the bytes
object?
Changing the signature to def attachments(self) -> Mapping[str, bytes]
would be a breaking change and thus we cannot do this for a longer time. We want to have a 4.0 release next year, hence we could add an info-message right now + do the change with pypdf==4.0.0. I don't see a way to do this over multiple releases. Would that be your a)
?
@MartinThoma I've got a version available. can you give me your opinion before I complete coverage
@MartinThoma
+1 ?
@MartinThoma +1?
Sorry, I forgot about this :see_no_evil:
A bit of notes for myself: RF is short for "Related Files" and "EF" is short for "Embedded File"; see Table 44 – Entries in a file specification dictionary. It is an optional entry with a dictionary:
A dictionary with the same structure as the EF dictionary, which shall be present. Each key in the RF dictionary shall also be present in the EF dictionary. Each value shall be a related files array (see 7.11.4.2, "Related Files Arrays") identifying files that are related to the corresponding file in the EF dictionary. If this entry is present, the Type entry is required and the file specification dictionary shall be indirectly referenced.
This PR attempts to solve two issues:
- #2087:
PdfReader._list_attachments
only considers/Names -> /EmbeddedFiles -> /Names
, but needs to consider/Kids
as well - #2090: The order in which files are added via
add_attachment
is not following the specs.
I'd prefer to have a PR + tests which deal with #2087 first before we fix #2090 or do bigger refactorings. First dealing with #2087 would make the current PR smaller, right?