pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

get_fields returns None on PDFs saved with PyPDF2

Open Vigrond opened this issue 1 year ago • 2 comments

reader.get_fields() fails to return fields after writing a PDF that has fields. Opening up the newly saved PDF in a reader such as Acrobat or a Browser still shows the fields, but PyPDF2's get_fields() function returns None.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.0-41-generic-x86_64-with-glibc2.35

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader, PdfWriter

def form_has_fields(form):
    reader = PdfReader(form)
    fields = reader.get_fields()
    
    is_fields = fields is not None
    print(f'get_fields for {form} is not None? : {is_fields}')
    
    return is_fields

def write_pdf(form, destination=None):
    reader = PdfReader(form)
    writer = PdfWriter()
    
    for page in reader.pages:
        writer.add_page(page)

    if destination:
        with open(destination, "wb") as output_stream:
            writer.write(output_stream)

pdf = "./field_test/f1040.pdf"
pdf_saved = "./field_test/f1040-saved.pdf"

form_has_fields(pdf)
write_pdf(pdf, pdf_saved)
form_has_fields(pdf_saved)

OUTPUT:

get_fields for ./field_test/f1040.pdf is not None? : True
get_fields for ./field_test/f1040-saved.pdf is not None? : False

PDF used: https://www.irs.gov/pub/irs-prior/f1040--2021.pdf

Vigrond avatar Aug 24 '22 22:08 Vigrond

Shortly after creating this issue, I discovered the clone_document_from_reader method that solves this issue.

Instead of adding pages to a writer, you may do this:

 writer.clone_document_from_reader(reader)

OUTPUT:

get_fields for ./field_test/f1040.pdf is not None? : True
get_fields for ./field_test/f1040-saved.pdf is not None? : True

Vigrond avatar Aug 24 '22 22:08 Vigrond

I would like to reopen this issue because some other unexpected behaviors happen:

  • clone_document_from_reader and clone_reader_document_root both, when written to a file, both produce files where get_fields works. However, when opened in Adobe Reader, you get the message: This document enabled extended features in Adobe Acrobat Reader. The document has been changed since it was created and use of extended features is no longer available. Adobe Reader then disables all fields on the document, making it uneditable. It also appears to load an older version of the document, or old data.
  • appendPagesFromReader when written, produces a pdf where get_fields is None, but Adobe Reader works normally - displaying editable fields with the correct data and no error messages.

Vigrond avatar Aug 25 '22 02:08 Vigrond

I was tinkering with document cloning and this is what I found (I'm no expert so if I'm wrong, please correct me):

I found that it wasn't working altogether. clone_reader_document_root() does only copy reference of root node, but the problem is that self._objects aren't copied (as the reader doesn't load all the objects) and then when root is added as indirect object (_add_object()) when writing the PDF, it reuse the wrong ID (which is already in use - reader). If one would want to clone the reader, it should also clone the all the objects (modulo the stuff as pages, ...) from the reader and store it in self._objects.

The problems with forms is that the form from the reader isn't copied to the writer (there is also bug I think in https://github.com/py-pdf/PyPDF2/blob/main/PyPDF2/_writer.py#L242 as ._add_object() should be used instead or it would reference some "random" object).

Trigve avatar Oct 26 '22 14:10 Trigve

Talking about cloning, you might be interested in https://github.com/py-pdf/PyPDF2/pull/1371 by @pubpub-zz :-)

MartinThoma avatar Oct 26 '22 16:10 MartinThoma

Cloning support would be really useful for cloning the form dictionary from original PDF (or whole PDF if needed).

I've looked at PR and have some questions Could I ask them directly in the PR or in discussion?

Trigve avatar Oct 26 '22 18:10 Trigve