pypdf
pypdf copied to clipboard
cloneDocumentFromReader produces a document with all blank pages
Perhaps I am misunderstanding what cloneDocumentFromReader does... but the following code produces a document of all blank pages.
signal.signal(signal.SIGALRM, timeout_alarm_handler)
signal.alarm(15)
reader = PdfFileReader(self.rfile, strict=False, )
writer = PdfFileWriter()
#writer.cloneReaderDocumentRoot(reader)
try:
writer.cloneDocumentFromReader(
reader
)
except TimeOutAlarm:
raise BurstingTimeOutException
else:
writer.write(self.wfile)
writer = None
Example document @ http://www.wmmi.net/documents/Bugs/PyPDF-114xSizeAfterRotation-Example.pdf
Cannot upload a document of PDF to issue
For the time being, when you need to clone the document down to the page structure (any page trees are preserved), use cloneReaderDocumentRoot
. If keeping the underlying page structure is not important, use appendPagesFromReader
to compact the page structure to a single page list with many page references.
I think that cloneDocumentFromReader
does not work like expected because cloneDocumentFromReader
uses both of the above methods, which causes a conflict when trying to write the document. This conflict occurs because the writer, when looking for all of the external references gets caught up in two different versions of the same object with the same version history, tricking the writer into thinking that the content has already been loaded when it has not been.
If you open the PDF output of your program in a text editor, you will notice that there are many references to indirect objects that are never defined within the document.
I have a potential solution for cloneReaderDocumentRoot
(type, or copy-and-paste, this in under the definition for cloneReaderDocumentRoot
and see if it does what you want):
'''
Create a copy (clone) of a document from a PDF file reader
:param reader: PDF file reader instance from which the clone
should be created.
:callback after_page_append (function): Callback function that is invoked after
each page is appended to the writer. Signature includes a reference to the
appended page (delegates to appendPagesFromReader). Callback signature:
:param writer_pageref (PDF page reference): Reference to the page just
appended to the document.
'''
debug = False
if debug:
print("Number of Objects: %d" % len(self._objects))
for obj in self._objects:
print("\tObject is %r" % obj)
if hasattr(obj, "indirectRef") and obj.indirectRef != None:
print("\t\tObject's reference is %r %r, at PDF %r" % (obj.indirectRef.idnum, obj.indirectRef.generation, obj.indirectRef.pdf))
# Variables used for after cloning the root to
# improve pre- and post- cloning experience
mustAddTogether = False
newInfoRef = self._info
oldPagesRef = self._pages
oldPages = self.getObject(self._pages)
# If there have already been any number of pages added
if oldPages[NameObject("/Count")] > 0:
# Keep them
mustAddTogether = True
else:
# Through the page object out
if oldPages in self._objects:
newInfoRef = self._pages
self._objects.remove(oldPages)
# Clone the reader's root document
self.cloneReaderDocumentRoot(reader)
if not self._root:
self._root = self._addObject(self._root_object)
# Sweep for all indirect references
externalReferenceMap = {}
self.stack = []
newRootRef = self._sweepIndirectReferences(externalReferenceMap, self._root)
# Delete the stack to reset
del self.stack
#Clean-Up Time!!!
# Get the new root of the PDF
realRoot = self.getObject(newRootRef)
# Get the new pages tree root and its ID Number
tmpPages = realRoot[NameObject("/Pages")]
newIdNumForPages = 1 + self._objects.index(tmpPages)
# Make an IndirectObject just for the new Pages
self._pages = IndirectObject(newIdNumForPages, 0, self)
# If there are any pages to add back in
if mustAddTogether:
# Set the new page's root's parent to the old
# page's root's reference
tmpPages[NameObject("/Parent")] = oldPagesRef
# Add the reference to the new page's root in
# the old page's kids array
newPagesRef = self._pages
oldPages[NameObject("/Kids")].append(newPagesRef)
# Set all references to the root of the old/new
# page's root
self._pages = oldPagesRef
realRoot[NameObject("/Pages")] = oldPagesRef
# Update the count attribute of the page's root
oldPages[NameObject("/Count")] = NumberObject(oldPages[NameObject("/Count")] + tmpPages[NameObject("/Count")])
else:
# Bump up the info's reference b/c the old
# page's tree was bumped off
self._info = newInfoRef
I think dhudson1's code post was meant to be cloneDocumentFromReader and not cloneReaderDocumentRoot . I tried it there and it seemed to work perfectly, but it alerted me to another problem that I am having.
Unfortunately this solution only clones the presentation of the document. The document is rebuilt, loosing the past document versions. A very useful clone will be a raw clone byte by byte of the document, and make that subsequent modifications generate a new version.
That will also bring the library the power to preserve the digital signatures for past versions.
Also this kind of problems would be avoided.
This would also make the library stronger, because it could preserve even objects that it can not parse totally. As long as I have seen, it seems that all this kind of libraries work parsing all the content with the reader to its own DOM and then the writer knows how to write them.
I mean the writer doesn't need to handle the semantic of all the existent objects in past versions, it can just include references to them, even when the reader doesn't know the semantic of the object. The object and past versions would be truly cloned.
Am I missing something ?
Anyway, now it works. Thanks. :-)
Step by step...
Any plan to review / merge this change ?
I also see this problem with cloneDocumentFromReader
. When I use cloneReaderDocumentRoot
instead, I get a PDF with No pages.
I also see this problem with
cloneDocumentFromReader
. When I usecloneReaderDocumentRoot
instead, I get a PDF with No pages.
so the same
@chickendiver I've never used that function. Could you give a minimal example on how you would like to use it?
Interesting: https://stackoverflow.com/search?q=cloneDocumentFromReader
So we have cloneDocumentFromReader, appendPagesFromReader and cloneReaderDocumentRoot and all 3 don't work properly?
I've just re-tried with the dev version (2.10.5?)
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()
with open("e:/tt.pdf", "wb") as f:
writer.write(f)
and the PDF file generated is good. The problem seems solved.
Amazing :rocket: Thank you for letting me know @pubpub-zz :heart:
The solution for me needed one extra line compared to the one given by @pubpub-zz
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("c:/PyPDF-114xSizeAfterRotation-Example.pdf")
writer = PdfWriter()
writer.clone_document_from_reader(reader)
with open("e:/tt.pdf", "wb") as f:
writer.write(f)
oups! But still good, isn't-it ?
For me, it still just outputs a blank deck... until I used the clone_document_from_reader