pypdf
pypdf copied to clipboard
Improvements to attachment functions
From https://github.com/py-pdf/PyPDF2/discussions/1046, the following may be useful to useful to cherry-pick if possible:
Improved embedded file handling (Rüdiger Jungbeck, rjungbeck) Allow attachment of more than 1 file with PdfFileWriter.addAttachment() Allow listing of attachments in PdfFileReader.listAttachments() Allow retrival of attachment PdfFileReader.getAttachment()
Looking at the docs here, I think all of this would be new functionality, with probably the most useful being listAttachments
and getAttachment
? I'm not sure about the alteration to addAttachment
, but perhaps make it a new function called add_attachments
?
Could we make an attachment property exposing a dictionary (or something that looks like it)? How does the signature of getAttachments look like?
The relevant code:
def getAttachment(self, name):
for i in range (0, len(self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"]), 2):
if self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"][i+1].getObject()["/F"]==name:
return self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"][i+1].getObject()["/EF"]["/F"].getObject().getData()
def listAttachments(self):
for i in range (0, len(self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"]), 2):
yield self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"][i+1].getObject()["/F"]
https://github.com/Tristan79/PyPDF2/commit/02e234253d8da97946c8f5168ed339e791d9621d seems to be the commit. I'll ask that user if they want to open a PR (so that we attribute this well in the commit history)
@kevinl95 Would you be interested in leading the implementation to this new change? Your PR #440 already goes into that direction, but due to the changes we made you would need to rebase (and use other method names / maybe a property)
This is the part that I refer to:
def getAttachments(self):
"""
Retrieves the file attachments of the PDF as a dictionary of file names
and the file data as a bytestring.
:return: dictionary of filenames and bytestrings
"""
catalog = self.trailer["/Root"]
# From the catalog get the embedded file names
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
attachments = {}
# Loop through attachments
for f in fileNames:
if isinstance(f, str):
name = f
dataIndex = fileNames.index(f) + 1
fDict = fileNames[dataIndex].getObject()
fData = fDict['/EF']['/F'].getData()
attachments[name] = fData
return attachments