pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

XFA fields not updated when using update_page_form_field_values()

Open pubpub-zz opened this issue 1 year ago • 7 comments

Environment

Python 3.10 pypdf 4.3.1+dev on sept,1st

Code + PDF

cf #2780 When modifying a form with XFA form, the fields in the XFA dataset are not modified

pubpub-zz avatar Sep 01 '24 14:09 pubpub-zz

So for my use case i found a solution by "just" parsing the xfa:dataset xml and setting the values and saving the XML string back, the question is: is that a valid approach for every XFA form or not? If that approach is valid, I'll gladly write a PR that enhances the update_page_form_field_values method or implement an additional method to accomplish this. But I'm not quite sure if my approach is more than a shortcut.

ljbergmann avatar Sep 02 '24 07:09 ljbergmann

Just working on the xfa will not allow standard tools to extract data from the fields information. My idea is just to extend the existing update_form_fields to also update xfa dataset if it exists

pubpub-zz avatar Sep 02 '24 11:09 pubpub-zz

I identified something very interesting during the implementation of the proposed extension of update_form_fields.

The XFA "keys" of fields are different then the names used by pypdf in AcroForm. To verify i created this pypdf_field_name_test.pdf . As you can clearly see in this screenshot the field is called F1. grafik

If you check the key provided by pypdf you can see that it is 'F1[0]'. You can check with the code below.

from pypdf import PdfReader

reader = PdfReader("pypdf_field_name_test.pdf")
fields = reader.get_form_text_fields()

print(fields)

{'F1[0]': None}

If you look at the XFA template / dataset xml the field is name F1.

<template xmlns="http://www.xfa.org/schema/xfa-template/3.3/"><?formServer defaultPDFRenderFormat acrobat10.0dynamic?>
	<subform name="form1" layout="tb" locale="de_DE" restoreState="auto">
		<pageSet>
			<pageArea name="Page1" id="Page1">
				<contentArea x="0.25in" y="0.25in" w="197.3mm" h="284.3mm"/>
				<medium stock="a4" short="210mm" long="297mm"/><?templateDesigner expand 1?>
			</pageArea><?templateDesigner expand 1?>
		</pageSet>
		<subform w="197.3mm" h="284.3mm" name="topform">
			<field name="F1" y="12.7mm" x="41.275mm" w="130.175mm" h="9mm">
				<ui>
					<textEdit>
						<border>
							<edge stroke="lowered"/>
						</border>
						<margin/>
					</textEdit>
				</ui>
				<font typeface="Arial"/>
				<para vAlign="middle"/>
				<caption>
					<para vAlign="middle"/>
					<value>
						<text>This is test of pypdf field names</text>
					</value>
				</caption>
			</field><?templateDesigner expand 1?>
		</subform>
		<proto/>
		<desc>
			<text name="version">11.0.9.20240701.1.52.2</text>
		</desc><?templateDesigner expand 1?><?renderCache.subset "Arial" 0 0 ISO-8859-1 4 72 18 0003002900370044004700480049004B004C004F005000510052005300560057005B005C FTadefhilmnopstxy?>
	</subform><?templateDesigner DefaultPreviewDynamic 1?><?templateDesigner DefaultRunAt client?><?templateDesigner FormTargetVersion 33?><?templateDesigner DefaultCaptionFontSettings face:Arial;size:10;weight:normal;style:normal?><?templateDesigner DefaultValueFontSettings face:Arial;size:10;weight:normal;style:normal?><?templateDesigner DefaultLanguage JavaScript?><?acrobat JavaScript strictScoping?><?templateDesigner Rulers horizontal:1, vertical:1, guidelines:1, crosshairs:0?><?templateDesigner Zoom 190?><?templateDesigner WidowOrphanControl 0?><?templateDesigner SaveTaggedPDF 1?><?templateDesigner SavePDFWithEmbeddedFonts 1?><?templateDesigner Grid show:1, snap:1, units:0, color:ff8080, origin:(0,0), interval:(125000,125000), objsnap:0, guidesnap:0, pagecentersnap:0?>
</template>

I suspect that the naming of the fields with [0] was a deliberate choice made in the implementation.

The questions that arises now: shouldn't the names in the XFA and the AcroForm be identical and if not, would the removal of the [0] to update the XFA be an valid approach?

In my opinion the names of fields should be consistent and therefor the AcroForm names should not contain [0].

Best regards, Leon

ljbergmann avatar Sep 04 '24 13:09 ljbergmann

some information are provided in https://pdfa.org/norm-refs/XFA-3_3.pdf

looking at "Field names" page 72++

pubpub-zz avatar Sep 04 '24 17:09 pubpub-zz

Hi @pubpub-zz / @ljbergmann, have we come to a conclusion on how to update field values in a PDF that uses XFA fields?

In the end, I only care about a final PDF that can opened by end users with the fields populated.


if __name__ == "__main__":
    reader = PdfReader(ONTARIO_TENANCY_AGREEMENT_PDF_TEMPLATE_PATH)

    writer = PdfWriter()
    writer.append(reader)

    with open(OUTPUT_PATH, "wb") as output_stream:
        writer.write(output_stream)

The code above just gives me a PDF with a single page that says:

If this message is not eventually replaced by the proper contents of the document, your PDF
viewer may not be able to display this type of document.

Seems like my XFA gets erased.

ericxiao251 avatar Sep 14 '25 22:09 ericxiao251

As this issue is still open and does not have a PR, there is no final solution/conclusion on this.

You are of course invited to investigate this issue yourself, document your findings and propose a PR afterwards.

stefan6419846 avatar Sep 15 '25 06:09 stefan6419846

@ericxiao251 XFAForms are part of the document. When you use .append() you can not copy it (this will erase the existing entry even if empty)

try some code like that:

if __name__ == "__main__":
    reader = PdfReader(ONTARIO_TENANCY_AGREEMENT_PDF_TEMPLATE_PATH)

    writer = PdfWriter(clone_from=reader)

    with open(OUTPUT_PATH, "wb") as output_stream:
        writer.write(output_stream)

or quicker:

if __name__ == "__main__":
    writer = PdfWriter(clone_from=ONTARIO_TENANCY_AGREEMENT_PDF_TEMPLATE_PATH)
    writer.write(OUTPUT_PATH)

pubpub-zz avatar Sep 16 '25 16:09 pubpub-zz