pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

cloneDocumentFromReader produces a document with all blank pages

Open whitemice opened this issue 9 years ago • 10 comments

Perhaps I am misunderstanding what cloneDocumentFromReader does... but the following code produces a document of all blank pages.

    signal.signal(signal.SIGALRM, timeout_alarm_handler)
    signal.alarm(15)

    reader = PdfFileReader(self.rfile, strict=False, )
    writer = PdfFileWriter()

    #writer.cloneReaderDocumentRoot(reader)
    try:
        writer.cloneDocumentFromReader(
            reader
        )
    except TimeOutAlarm:
        raise BurstingTimeOutException
    else:
        writer.write(self.wfile)
    writer = None

whitemice avatar Aug 07 '15 14:08 whitemice

Example document @ http://www.wmmi.net/documents/Bugs/PyPDF-114xSizeAfterRotation-Example.pdf

Cannot upload a document of PDF to issue

whitemice avatar Aug 07 '15 14:08 whitemice

For the time being, when you need to clone the document down to the page structure (any page trees are preserved), use cloneReaderDocumentRoot. If keeping the underlying page structure is not important, use appendPagesFromReader to compact the page structure to a single page list with many page references.

I think that cloneDocumentFromReader does not work like expected because cloneDocumentFromReader uses both of the above methods, which causes a conflict when trying to write the document. This conflict occurs because the writer, when looking for all of the external references gets caught up in two different versions of the same object with the same version history, tricking the writer into thinking that the content has already been loaded when it has not been.

If you open the PDF output of your program in a text editor, you will notice that there are many references to indirect objects that are never defined within the document.

dhudson1 avatar Aug 13 '15 22:08 dhudson1

I have a potential solution for cloneReaderDocumentRoot (type, or copy-and-paste, this in under the definition for cloneReaderDocumentRoot and see if it does what you want):

        '''
        Create a copy (clone) of a document from a PDF file reader

        :param reader: PDF file reader instance from which the clone
            should be created.
        :callback after_page_append (function): Callback function that is invoked after
            each page is appended to the writer. Signature includes a reference to the
            appended page (delegates to appendPagesFromReader). Callback signature:

            :param writer_pageref (PDF page reference): Reference to the page just
                appended to the document.
        '''
        debug = False
        if debug:
            print("Number of Objects: %d" % len(self._objects))
            for obj in self._objects:
                print("\tObject is %r" % obj)
                if hasattr(obj, "indirectRef") and obj.indirectRef != None:
                    print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))

        # Variables used for after cloning the root to
        # improve pre- and post- cloning experience

        mustAddTogether = False
        newInfoRef = self._info
        oldPagesRef = self._pages
        oldPages = self.getObject(self._pages)

        # If there have already been any number of pages added

        if oldPages[NameObject("/Count")] > 0:

            # Keep them

            mustAddTogether = True
        else:

            # Through the page object out

            if oldPages in self._objects:
                newInfoRef = self._pages
                self._objects.remove(oldPages)

        # Clone the reader's root document

        self.cloneReaderDocumentRoot(reader)
        if not self._root:
            self._root = self._addObject(self._root_object)

        # Sweep for all indirect references

        externalReferenceMap = {}
        self.stack = []
        newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)

        # Delete the stack to reset

        del self.stack

        #Clean-Up Time!!!

        # Get the new root of the PDF

        realRoot = self.getObject(newRootRef)

        # Get the new pages tree root and its ID Number

        tmpPages = realRoot[NameObject("/Pages")]
        newIdNumForPages = 1 + self._objects.index(tmpPages)

        # Make an IndirectObject just for the new Pages

        self._pages = IndirectObject(newIdNumForPages, 0, self)

        # If there are any pages to add back in

        if mustAddTogether:

            # Set the new page's root's parent to the old
            # page's root's reference

            tmpPages[NameObject("/Parent")] = oldPagesRef

            # Add the reference to the new page's root in
            # the old page's kids array

            newPagesRef = self._pages
            oldPages[NameObject("/Kids")].append(newPagesRef)

            # Set all references to the root of the old/new
            # page's root

            self._pages = oldPagesRef
            realRoot[NameObject("/Pages")] = oldPagesRef

            # Update the count attribute of the page's root

            oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])

        else:

            # Bump up the info's reference b/c the old
            # page's tree was bumped off

            self._info = newInfoRef

dhudson1 avatar Aug 14 '15 22:08 dhudson1

I think dhudson1's code post was meant to be cloneDocumentFromReader and not cloneReaderDocumentRoot . I tried it there and it seemed to work perfectly, but it alerted me to another problem that I am having.

LightningMan711 avatar May 23 '16 16:05 LightningMan711

Unfortunately this solution only clones the presentation of the document. The document is rebuilt, loosing the past document versions. A very useful clone will be a raw clone byte by byte of the document, and make that subsequent modifications generate a new version.

That will also bring the library the power to preserve the digital signatures for past versions.

Also this kind of problems would be avoided.

This would also make the library stronger, because it could preserve even objects that it can not parse totally. As long as I have seen, it seems that all this kind of libraries work parsing all the content with the reader to its own DOM and then the writer knows how to write them.

I mean the writer doesn't need to handle the semantic of all the existent objects in past versions, it can just include references to them, even when the reader doesn't know the semantic of the object. The object and past versions would be truly cloned.

Am I missing something ?

yucer avatar Sep 22 '16 16:09 yucer

Anyway, now it works. Thanks. :-)

Step by step...

Any plan to review / merge this change ?

yucer avatar Sep 22 '16 17:09 yucer

I also see this problem with cloneDocumentFromReader. When I use cloneReaderDocumentRoot instead, I get a PDF with No pages.

benwiggy avatar Apr 21 '19 15:04 benwiggy

I also see this problem with cloneDocumentFromReader. When I use cloneReaderDocumentRoot instead, I get a PDF with No pages.

so the same

livizy avatar Jan 12 '21 01:01 livizy

@chickendiver I've never used that function. Could you give a minimal example on how you would like to use it?

MartinThoma avatar Jul 05 '22 19:07 MartinThoma

Interesting: https://stackoverflow.com/search?q=cloneDocumentFromReader

So we have cloneDocumentFromReader, appendPagesFromReader and cloneReaderDocumentRoot and all 3 don't work properly?

MartinThoma avatar Jul 05 '22 20:07 MartinThoma

I've just re-tried with the dev version (2.10.5?)

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()

with open("e:/tt.pdf", "wb") as f:
    writer.write(f)

and the PDF file generated is good. The problem seems solved.

pubpub-zz avatar Sep 03 '22 12:09 pubpub-zz

Amazing :rocket: Thank you for letting me know @pubpub-zz :heart:

MartinThoma avatar Sep 06 '22 19:09 MartinThoma

The solution for me needed one extra line compared to the one given by @pubpub-zz

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()

writer.clone_document_from_reader(reader)

with open("e:/tt.pdf", "wb") as f:
    writer.write(f)

johnf1004 avatar Sep 15 '22 17:09 johnf1004

oups! But still good, isn't-it ?

pubpub-zz avatar Sep 15 '22 18:09 pubpub-zz

For me, it still just outputs a blank deck... until I used the clone_document_from_reader

johnf1004 avatar Sep 15 '22 22:09 johnf1004