pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Improvements to attachment functions

Open MasterOdin opened this issue 2 years ago • 4 comments

From https://github.com/py-pdf/PyPDF2/discussions/1046, the following may be useful to useful to cherry-pick if possible:

Improved embedded file handling (Rüdiger Jungbeck, rjungbeck) Allow attachment of more than 1 file with PdfFileWriter.addAttachment() Allow listing of attachments in PdfFileReader.listAttachments() Allow retrival of attachment PdfFileReader.getAttachment()

Looking at the docs here, I think all of this would be new functionality, with probably the most useful being listAttachments and getAttachment? I'm not sure about the alteration to addAttachment, but perhaps make it a new function called add_attachments?

MasterOdin avatar Jul 01 '22 04:07 MasterOdin

Could we make an attachment property exposing a dictionary (or something that looks like it)? How does the signature of getAttachments look like?

MartinThoma avatar Jul 01 '22 11:07 MartinThoma

The relevant code:

    def getAttachment(self, name):
        for i in range (0, len(self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"]), 2):
            if self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"][i+1].getObject()["/F"]==name:
                return self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"][i+1].getObject()["/EF"]["/F"].getObject().getData()

    def listAttachments(self):

        for i in range (0, len(self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"]), 2):
            yield self.trailer["/Root"]["/Names"]["/EmbeddedFiles"]["/Names"][i+1].getObject()["/F"]

MasterOdin avatar Jul 02 '22 15:07 MasterOdin

https://github.com/Tristan79/PyPDF2/commit/02e234253d8da97946c8f5168ed339e791d9621d seems to be the commit. I'll ask that user if they want to open a PR (so that we attribute this well in the commit history)

MartinThoma avatar Jul 03 '22 19:07 MartinThoma

@kevinl95 Would you be interested in leading the implementation to this new change? Your PR #440 already goes into that direction, but due to the changes we made you would need to rebase (and use other method names / maybe a property)

This is the part that I refer to:

    def getAttachments(self):
          """
          Retrieves the file attachments of the PDF as a dictionary of file names
          and the file data as a bytestring.
          :return: dictionary of filenames and bytestrings
          """
          catalog = self.trailer["/Root"]
          # From the catalog get the embedded file names
          fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
          attachments = {}
          # Loop through attachments
          for f in fileNames:
              if isinstance(f, str):
                  name = f
                  dataIndex = fileNames.index(f) + 1
                  fDict = fileNames[dataIndex].getObject()
                  fData = fDict['/EF']['/F'].getData()
                  attachments[name] = fData
          return attachments

MartinThoma avatar Jul 05 '22 20:07 MartinThoma