pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Updated pdf fields don't show up when page is written

Open segevmalool opened this issue 7 years ago • 51 comments

I'd like to use PyPDF2 to fill out a pdf form. So far, everything is going smoothly, including updating the field text. But when I write the pdf to a file, there is apparently no change in the form. Running this code:

import datetime as dt
from PyPDF2 import PdfFileReader, PdfFileWriter
import re

form701 = PdfFileReader('ABC701LG.pdf')
page = form701.getPage(0)
filled = PdfFileWriter()

#removing extraneous fields
r = re.compile('^[0-9]')
fields = sorted(list(filter(r.match, form701.getFields().keys())), key = lambda x: int(x[0:2]))

filled.addPage(page)
filled.updatePageFormFieldValues(filled.getPage(0), 
                                 {fields[0]: 'some filled in text'})

print(filled.getPage(0)['/Annots'][0].getObject()['/T'])
print(filled.getPage(0)['/Annots'][0].getObject()['/V'])

with open('test.pdf','wb') as fp:
    filled.write(fp)

prints text:

1 EFFECTIVE DATE OF THIS SCHEDULE <i.e. the field name> some filled in text

But when I open up test.pdf, there is no added text on the page! Help!

segevmalool avatar Jun 13 '17 17:06 segevmalool

I am having this same issue. The data does not show up in Adobe Reader unless you activate the field. The data does show up in Bluebeam but if you print, flatten, or push the pdf to a studio session all the data is lost.

When the file is opened in Bluebeam it automatically thinks that the user has made changes, denoted by the asterisk next to the file name in the tab.

If you export the fdf file from Bluebeam all the data is in the fdf file in the proper place.

If you change any attribute of the field in Bluebeam or Adobe, it will recognize the text in that field. It will print correctly and flatten correctly. I am not sure if it will push to the Bluebeam studio but I assume it will. You can also just copy and paste the text in the field back into that field and it will render correctly.

I have not found any help after googling around all day. I think it is an issue with PyPDF2 not "redrawing" the PDF correctly.

I have contacted Bluebeam support and they have returned saying essentially that it is not on their end.

mwhit74 avatar Aug 22 '17 20:08 mwhit74

Ok I think I have narrowed this down some by just comparing two different pdfs.

For reference I am trying to read a pdf that was originally created by Bluebeam, use the updatePageFormFields() function in PyPDF2 to push a bunch of data from a database into the form fields, and save. At some point we want to flatten these and that is when it all goes wrong in Bluebeam. In Adobe it is messed up from the start in that you don't see any values in the form fields until you scroll over them with the mouse.

I appears there is a problem with the stream object that follows the object(s) representing the text form field. See below.

This is a sample output from a pdf generated by PyPDF2 for a text form field:

26 0 obj<</Subtype/Widget/M(D:20160512102729-05'00')/NM(OEGVASQHFKGZPSZW)/MK<</IF<</A[0 0]>>>>/F 4/C[1 0 0]/Rect[227.157 346.3074 438.2147 380.0766]/V(Marshall CYG)/Type/Annot/FT/Tx/AP<</N 27 0 R>>/DA(0 0 0 rg /Helv 12 Tf)/T(Owner Group)/BS 29 0 R/Q 0/P 3 0 R>>
endobj
27 0 obj<</Type/XObject/Matrix[1 0 0 1 0 0]/Resources<</ProcSet[/PDF/Text]/Font<</Helv 28 0 R>>>>/Length 41/FormType 1/BBox[0 0 211.0577 33.76923]/Subtype/Form>>
stream
0 0 211.0577 33.76923 re W n /Tx BMC EMC 
endstream
endobj
28 0 

And if I back up and edit the same based file in Bluebeam the output from that pdf for a text form field looks like this (I think the border object can be ignored):

16 0 obj<</Type/Annot/P 5 0 R/F 4/C[1 0 0]/Subtype/Widget/Q 0/FT/Tx/T(Owner Group)/MK<</IF<</A[0 0]>>>>/DA(0 0 0 rg /Helv 12 Tf)/AP<</N 18 0 R>>/M(D:20170906125217-05'00')/Rect[227.157 346.3074 438.2147 380.0766]/NM(OEGVASQHFKGZPSZW)/BS 17 0 R/V(Marshall CYG)>>
endobj
17 0 obj<</W 1/S/S/Type/Border>>
endobj
18 0 obj<</Type/XObject/Subtype/Form/FormType 1/BBox[0 0 211.0577 33.7692]/Resources<</ProcSet[/PDF/Text]/Font<</Helv 12 0 R>>>>/Matrix[1 0 0 1 0 0]/Length 106>>
stream
0 0 211.0577 33.7692 re W n /Tx BMC BT 0 0 0 rg /Helv 12 Tf 1 0 0 1 2 12.6486 Tm (Marshall CYG) Tj ET EMC 
endstream

Ok so the biggest difference here is the stream object at the end. The value /V(Marshall CYG) gets updated in the first object of each pdf, objects 26 and 16 respectively. However the stream object in the PyPDF2 generated pdf does not get updated and the stream object from Bluebeam does get updated.

In testing this theory I made a copy of the PyPDF2 pdf and manually edited the stream object in a text editor. I open this new file in Bluebeam and flattened it. It worked. This also appears to work in adobe reader.

Now how to fix....

mwhit74 avatar Sep 06 '17 18:09 mwhit74

A potential solution seems to be setting the Need Appearances flag. Not yet sure how to implement in pypdf2 but these 2 links may provide some clues: https://stackoverflow.com/questions/12198742/pdf-form-text-hidden-unless-clicked https://forums.adobe.com/thread/305250

ademidun avatar Dec 16 '17 04:12 ademidun

Okay, I think I have figured it out. If you read section 12.7.2 (page 431) of the PDF 1.7 specification, you will see that you need to set the NeedAppearances flag of the Acroform.

reader = PdfFileReader(open(infile, "rb"), strict=False)

if "/AcroForm" in reader.trailer["/Root"]:
    reader.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)}
    )
writer = PdfFileWriter()

if "/AcroForm" in writer._root_object:
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)}
    )

ademidun avatar Dec 20 '17 23:12 ademidun

ademidun - Can you elaborate on your suggested solution above? I too am having problems with pdf forms, edited with PyPDF2, not showing field values without clicking in the field. With the code example below, how do you "set the NeedAppearances flag of the Acroform"?

from PyPDF2 import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input = PdfFileReader(open("myInputPdf.pdf", "rb"))

field_dictionary = {'Make': 'Toyota', 'Model': 'Tacoma'}

for pageNum in range(input.numPages):
    pageObj = input.getPage(pageNum)
    output.addPage(pageObj)
    output.updatePageFormFieldValues(pageObj, field_dictionary)

outputStream = open("myOutputPdf.pdf", "wb")
output.write(outputStream)

I tried adding in your IF statements but two problems arise: 1) NameObject and BooleanObject are not defined within my PdfFileReader "input" variable (I do not know how to do that) and 2) "/AcroForm" is not found within the PdfFileWriter object (my "output" variable).

Thanks for any help!

Tromar44 avatar Jan 24 '18 23:01 Tromar44

@Tromar44 Preamble, make sure your form is interactive. E.g. The pdf must already have editable fields.

  1. Sorry forgot to mention you will have to import them: from PyPDF2.generic import BooleanObject, NameObject, IndirectObject
  2. Are you sure you are using output.__root_object["/AcroForm"] or output.trailer["/Root"]["/AcroForm"] to access the "/AcroForm" key? and not just doing output["/AcroForm"]

ademidun avatar Jan 25 '18 00:01 ademidun

@ademidun I thank you very much for your help but unfortunately I'm still not having any luck. To be clear, my simple test pdf form does have two editable fields and the script will populate them with "Toyota" and "Tacoma" respectively but those values are not visible unless I click on the field in the form (they become invisible again after the field loses focus). Here is the rewritten code that includes your suggestions and the results of running the code in inline comments.

from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject

infile = "myInputPdf.pdf"
outfile = "myOutputPdf.pdf"

reader = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in reader.trailer["/Root"]: # result: following "IF code is executed
    print(True)
    reader.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

writer = PdfFileWriter()
if "/AcroForm" in writer._root_object: # result: False - following "IF" code is NOT executed
    print(True)
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

if "/AcroForm" in writer._root_object["/AcroForm"]: # result: "KeyError: '/AcroForm'
    print(True)
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

if "/AcroForm" in writer.trailer["/Root"]["/AcroForm"]:  # result: AttributeError: 'PdfFileWriter' object has no attribute 'trailer'
    print(True)
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

field_dictionary = {"Make": "Toyota", "Model": "Tacoma"}

writer.addPage(reader.getPage(0))
writer.updatePageFormFieldValues(writer.getPage(0), field_dictionary)

outputStream = open(outfile, "wb")
writer.write(outputStream)

I would definitely appreciate any more suggestions that you may have! Thank you very much!

Tromar44 avatar Jan 25 '18 18:01 Tromar44

It may also be a browser issue. I don't have the links anymore but I remember reading about some issues when opening/creating a PDF on Preview on Mac or viewing it in the browser vs. using an Adobe app etc. Maybe if you google things like "form fields only showing on click" or "form fields only active on click using preview mac".

I also recommend reading the PDF spec link I posted, its a bit dense but a combination of all these should get you in the right direction.

ademidun avatar Jan 25 '18 19:01 ademidun

@Tromar44 Okay, I also found this snippet from my code, maybe it will help:

def set_need_appearances_writer(writer: PdfFileWriter):
    # See 12.7.2 and 7.7.2 for more information: http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)
            })

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        # del writer._root_object["/AcroForm"]['NeedAppearances']
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer

ademidun avatar Jan 25 '18 19:01 ademidun

@ademidun That worked perfectly (I'd high five you right now if I could)! Thank you very much! For anyone else interested, the following worked for me:

from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject

def set_need_appearances_writer(writer: PdfFileWriter):
    # See 12.7.2 and 7.7.2 for more information: http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer

infile = "input.pdf"
outfile = "output.pdf"

reader = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in reader.trailer["/Root"]:
    reader.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

writer = PdfFileWriter()
set_need_appearances_writer(writer)
if "/AcroForm" in writer._root_object:
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

field_dictionary = {"Make": "Toyota", "Model": "Tacoma"}

writer.addPage(reader.getPage(0))
writer.updatePageFormFieldValues(writer.getPage(0), field_dictionary)

with open(outfile, "wb") as fp:
    writer.write(fp)

Tromar44 avatar Jan 25 '18 19:01 Tromar44

@ademidun you great!!!

kissmett avatar Feb 02 '18 08:02 kissmett

Just stumbled upon this solution - great work! A couple of issues I noticed - can you reproduce them? - won't have time to send test case details for a couple of days yet if you need them; we had been using the good-ol fdfgen-then-pdftk-subprocess-call method but would like to get away from the external pdftk dependency so pypdf2 is great:

  • text field values show on the generated pdf, but checkbox field values (populated with True or False) don't seem to show up
  • there are some vertical shifting issues in pypdf2 output as compared to pdftk output - a few of the fields get bumped up or down on the generated pdf

caver456 avatar Feb 08 '18 14:02 caver456

output.pdf Does not work in the fields in this file, for example, the first field for the phone, the second one for some reason works and a few more fields, so the fix is ​​not working

shurshilov avatar Apr 09 '18 12:04 shurshilov

Hi i am facing the same issue...i have tried setting need lreferences true also.when i edited pdf using pypdf2 some fields are displaying correctly and some are displaying only after i click on that filed.Please help me out on this issue as it is blocking me from the work. Thank you

saipawan999 avatar Oct 28 '18 11:10 saipawan999

The code works great! but only for PDFs with one page. I tried splitting my PDF into several one page files and looped through it. This worked great but when I merged them back together, the click-to-reveal-text problem reemerged. The problem lies in the .addPage command for the PdfFileWritter.

for page_number in range(pdf.total_pages):
    pdf2.addPage(pdf.getPage(page_number))
    pdf2.updatePageFormFieldValues(pdf2.getPage(page_number), field_dictionary)

When I enter this and try to save, I get an error message: "TypeError: argument should be integer or None, not 'NullObject'" It seems that the .addpage does not append the filewriter but treats each page as a seperate object. Does some one have a solution for this?

Problem solved: I figured out the problem was I was running a protected PDF. I manually split the PDF and manually recombind it and now it works great. The solution is often right in front of your nose.

fvw222 avatar Feb 15 '19 12:02 fvw222

Hi All,

Thanks for your help.

I was able to view the text fields of the PDF Form using pypdf2. But still could not figure out to make the visibility(need appearances) of the checkbox of PDF Form.

Tried with this logic : catalog = writer._root_object if '/AcroForm' in catalog: writer._root_object["/AcroForm"].update( {NameObject("/NeedAppearances"): BooleanObject(True)})

Thanks in advance.

aatish29 avatar Feb 19 '19 10:02 aatish29

I found answer for checkboxes issue at https://stackoverflow.com/questions/35538851/how-to-check-uncheck-checkboxes-in-a-pdf-with-python-preferably-pypdf2.

def updateCheckboxValues(page, fields):

    for j in range(0, len(page['/Annots'])):
        writer_annot = page['/Annots'][j].getObject()
        for field in fields:
            if writer_annot.get('/T') == field:
                writer_annot.update({
                    NameObject("/V"): NameObject(fields[field]),
                    NameObject("/AS"): NameObject(fields[field])
                })

And as the comment says checked value could be anything depending on how the form was created. It was present in '/AP' for me. Which I extracted using list(writer_annot.get('/AP').get('/N').keys())[0].

karnh avatar Mar 25 '19 15:03 karnh

ok, I have implemented the above and it works on my pdf forms however once the form has been updated by the python it can't be run through the code a second time, as getFormFields returns an empty list. If I open the updated pdf in Adobe and add a space to the end of a form field value and save, run the code on the form again, getFormFields returns the correct list.

madornetto avatar Mar 26 '19 18:03 madornetto

I am having the same problem: fields not visible fixed by above-mentioned set_need_appearances_writer() approach but getFormFields/pdftk dump_data_fields does not see them.

In addition, it looks like my fonts somehow get messed up: one of the fields is actually a barcode font. But, after going through PyPDF2 to make a copy with updated fields, the field that uses the barcode font in the original copy now uses one of the other fonts.

ghost avatar Apr 11 '19 21:04 ghost

I'm experiencing the same click-to-reveal-text issue. Here are a few interesting things I have noticed.

  • When using some of the irs forms e.g. https://www.irs.gov/pub/irs-pdf/f1095c.pdf, the issue doesn't happen.
  • When creating forms PDFElement's 'Form Field Recognition' feature, the issue doesn't happen.
  • When manually adding fields using PDFElement, the issue happens sometimes.

willingham avatar Oct 03 '19 16:10 willingham

t can't be run through the code a second time, as getFormFields returns an empty list.

For reference, I just stumbled on the same issue. The problem is that the generated pdf does not have an /AcroForm, and the easiest solution is probably to copy it over from the source file like this:

trailer = reader.trailer["/Root"]["/AcroForm"]
writer._root_object.update({
        NameObject('/AcroForm'): trailer
    })

mjl avatar Oct 08 '19 14:10 mjl

@mjl can you elaborate how to implement those lines?

Nivatius avatar Oct 19 '19 13:10 Nivatius

anyone figure out a solution to set /NeedAppearance for a pdf with multiple pages?

zoiiieee avatar Jan 19 '20 11:01 zoiiieee

To include multiple pages to the output PDF, I added the pages from the template onto the outpuf file....

if "/AcroForm" in pdf2._root_object:
        pdf2._root_object["/AcroForm"].update(
                {NameObject("/NeedAppearances"): BooleanObject(True)})
        pdf2.addPage(pdf.getPage(0))
        pdf2.updatePageFormFieldValues(pdf2.getPage(0), student_data)
        **pdf2.addPage(pdf.getPage(1))
        pdf2.addPage(pdf.getPage(2))**
        outputStream = open(cs_output, "wb")
        pdf2.write(outputStream)
        outputStream.close()

sstamand avatar Jan 30 '20 16:01 sstamand

To include multiple pages to the output PDF, I added the pages from the template onto the outpuf file....

I tried the same thing but Need Appearances seems to apply only to the first page. All the fields on the second page are hidden until focused.

zoiiieee avatar Jan 30 '20 16:01 zoiiieee

Does anyone have a working fix for this issue for multi-page PDFs?

jeffneuen avatar Feb 28 '20 20:02 jeffneuen

@mjl can you elaborate how to implement those lines?

You will have a pdf-reader reading in the origin file and a pdf-writer, creating the new pdf (see code of @Tromar44 above). Now you simply need to "copy" over the AcroForm with the lines @mjl presented.

brunnurs avatar Mar 23 '20 17:03 brunnurs

From all those explanations I arrived (as brunnurs stated) to this code. It works for me. Fill textentries and checkboxes for multipage pdf and all changes can be seen using any simple pdf reader.

`from PyPDF2 import PdfFileReader, PdfFileWriter from PyPDF2.generic import BooleanObject, NameObject, IndirectObject, TextStringObject

def set_need_appearances_writer(writer):

try:
    catalog = writer._root_object
    # get the AcroForm tree and add "/NeedAppearances attribute
    if "/AcroForm" not in catalog:
        writer._root_object.update({
            NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

    need_appearances = NameObject("/NeedAppearances")
    writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
    return writer

except Exception as e:
    print('set_need_appearances_writer() catch : ', repr(e))
    return writer

class PdfFileFiller(object):

def __init__(self, infile):
    
    self.pdf = PdfFileReader(open(infile, "rb"), strict=False)
    if "/AcroForm" in self.pdf.trailer["/Root"]:
        self.pdf.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})
    

    
def update_form_values(self, outfile, newvals=None, newchecks=None):

    self.pdf2 = MyPdfFileWriter()


    trailer = self.pdf.trailer["/Root"]["/AcroForm"]
    self.pdf2._root_object.update({
        NameObject('/AcroForm'): trailer})

    set_need_appearances_writer(self.pdf2)
    if "/AcroForm" in self.pdf2._root_object:
        self.pdf2._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})
    
    for i in range(self.pdf.getNumPages()):
        self.pdf2.addPage(self.pdf.getPage(i))
        self.pdf2.updatePageFormFieldValues(self.pdf2.getPage(i), newvals)
        self.pdf2.updatePageFormCheckboxValues(self.pdf2.getPage(i), newchecks)

    with open(outfile, 'wb') as out:
        self.pdf2.write(out)
    
        

class MyPdfFileWriter(PdfFileWriter):

def __init__(self):
    super().__init__()
    
def updatePageFormCheckboxValues(self, page, fields):

    for j in range(0, len(page['/Annots'])):
        writer_annot = page['/Annots'][j].getObject()
        for field in fields:
            if writer_annot.get('/T') == field:
                #print('-------------------------------------')
                #print('     FOUND', field)
                #print(writer_annot.get('/V'))
                writer_annot.update({
                    NameObject("/V"): NameObject(fields[field]),
                    NameObject("/AS"): NameObject(fields[field])
                })

if name == 'main':

    origin = '900in.pdf'
    destination = '900out.pdf'
    newvals = {"IDETNCON[0]": "A123456T",
                "NOMSOL[0]": "ARTICA S.L."}
    newchecks={"periodeliq1[0]": "/1"}
                

    c = PdfFileFiller(origin)
    c. update_form_values(outfile=destination,
                          newvals=newvals,
                          newchecks=newchecks)`

hchillon avatar Apr 17 '20 21:04 hchillon

Last code fails for checkboxes using some pdf readers. I modified my MyPdfWriter class:

`def updatePageFormCheckboxValues(self, page, fields):

    for j in range(0, len(page['/Annots'])):
        writer_annot = page['/Annots'][j].getObject()
        for field in fields:
            if writer_annot.get('/T') == field:
                if fields[field] in ('/1', '/Yes'): # You choose which values use in your code
                    writer_annot.update({
                        NameObject("/V"): NameObject(fields[field]),
                        NameObject("/AS"): NameObject(fields[field])
                    })`

hchillon avatar Apr 18 '20 09:04 hchillon

I am still having issues in showing filled boxed outside of Adobe Acrobat.

from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject

def set_need_appearances_writer(writer: PdfFileWriter):
    # See 12.7.2 and 7.7.2 for more information: http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer

infile = "input.pdf"
outfile = "output.pdf"

pdf = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in pdf.trailer["/Root"]:
    pdf.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

pdf2 = PdfFileWriter()
set_need_appearances_writer(pdf2)
if "/AcroForm" in pdf2._root_object:
    pdf2._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

field_dictionary = {"iban1_part1": "DE", "Model": "Tacoma"}

pdf2.addPage(pdf.getPage(0))
pdf2.updatePageFormFieldValues(pdf2.getPage(0), field_dictionary)

outputStream = open(outfile, "wb")
pdf2.write(outputStream)

Some boxes are showing properly, some are not - when outside of Acrobat and I need to click on them to show the content.

I also did the same using pdfrw but I got stuck exactly at the same problem.

giorgio-pap avatar May 20 '20 10:05 giorgio-pap