pypdf
pypdf copied to clipboard
Getting the page number of a field
Explanation
I am trying to update each field of a PDF file with some data but I don't know which PageObject to use after getting the list of fields from the function PdfReader.get_field().
Code Example
How would your feature be used? (Remove this if it is not applicable.)
Maybe with a new function:
PdfReader.get_form_page(self, field: Field) -> tuple[int, PageObject]
Return:
- the page number
- the page object
Example:
from pypdf import PdfReader
reader = PdfReader("form.pdf")
fields = reader.get_fields()
page_number, page = reader.get_form_page(fields[0])
Thanks for the remark. The forms documentation should probably be expanded to explain the relationship between AcroForm and Annotations better. I'm not an exper there myself, but here is what I understood:
AcroForm and Annotation
/AcroForm
is defined in the trailer and looks like this:
/Fields [ 15 0 R 16 0 R 17 0 R ]
/DR <<
/Font <<
/ZaDb 5 0 R
/Helv 6 0 R
>>
>>
/DA (/Helv 10 Tf 0 g)
/NeedAppearances true
This form has 3 fields. The fields have corresponding annotations which are defined on the pages on which they appear. In my case:
15 0 obj
<<
/Type /Annot
/Rect [ 182.198 650.66 269.23 668.194 ]
/Subtype /Widget
/F 4
/T <FEFF004E0061006D0065>
/FT /Tx
/Q 0
/BS <<
/W 1
/S /S
>>
/MK <<
/BC [ 1 0 0 ]
/BG [ 1 1 1 ]
>>
/DA (/Helv 10 Tf 0 0 0 rg)
/DV ()
/V ()
>>
endobj
16 0 obj
<<
/Type /Annot
/Rect [ 183.582 623.163 195.537 640.697 ]
/Subtype /Widget
/F 4
/T <FEFF0043006800650063006B>
/FT /Btn
/Q 0
/BS <<
/W 1
/S /S
>>
/AP <<
/N <<
/Yes <<
>>
>>
>>
/MK <<
/BC [ 1 0 0 ]
/BG [ 1 1 1 ]
/CA (4)
>>
/DA (/ZaDb 10 Tf 0 0 0 rg)
/H /P
/V /Off
/AS /Off
>>
endobj
17 0 obj
<<
/Type /Annot
/Rect [ 153.694 598.703 189.235 613.2 ]
/Subtype /Widget
/F 4
/T (Submit)
/FT /Btn
/Ff 65540
/H /P
/BS <<
/W 1
/S /S
>>
/MK <<
/BC [ 1 0 0 ]
>>
/A <<
/S /SubmitForm
/F <<
/FS /URL
/F (http://exampe.com)
>>
>>
/AP <<
/N 10 0 R
/D 13 0 R
>>
>>
endobj
What pypdf does: The current interface
For the PDF above, pypdf would give:
>>> from pypdf import PdfReader
>>> reader = PdfReader("pdflatex-forms.pdf")
>>> fields = reader.get_fields()
>>> fields
{'Name': {'/T': 'Name', '/FT': '/Tx', '/V': '', '/DV': ''},
'Check': {'/T': 'Check', '/FT': '/Btn', '/V': '/Off', '/_States_': ['/Yes', '/Off']},
'Submit': {'/T': 'Submit', '/FT': '/Btn', '/Ff': 65540, '/_States_': ['/Type', '/Subtype', '/BBox', '/FormType', '/Matrix', '/Resources', '/Filter', '/Off']}}
and
>>> from pypdf import PdfReader
>>> reader = PdfReader("pdflatex-forms.pdf")
>>> for page in reader.pages:
... for annot in page.annotations:
... print(annot.get_object())
...
{'/Type': '/Annot', '/Rect': [182.198, 650.66, 269.23, 668.194], '/Subtype': '/Widget', '/F': 4, '/T': 'Name', '/FT': '/Tx', '/Q': 0, '/BS': {'/W': 1, '/S': '/S'}, '/MK': {'/BC': [1, 0, 0], '/BG': [1, 1, 1]}, '/DA': '/Helv 10 Tf 0 0 0 rg', '/DV': '', '/V': ''}
{'/Type': '/Annot', '/Rect': [183.582, 623.163, 195.537, 640.697], '/Subtype': '/Widget', '/F': 4, '/T': 'Check', '/FT': '/Btn', '/Q': 0, '/BS': {'/W': 1, '/S': '/S'}, '/AP': {'/N': {'/Yes': {}}}, '/MK': {'/BC': [1, 0, 0], '/BG': [1, 1, 1], '/CA': '4'}, '/DA': '/ZaDb 10 Tf 0 0 0 rg', '/H': '/P', '/V': '/Off', '/AS': '/Off'}
{'/Type': '/Annot', '/Rect': [153.694, 598.703, 189.235, 613.2], '/Subtype': '/Widget', '/F': 4, '/T': 'Submit', '/FT': '/Btn', '/Ff': 65540, '/H': '/P', '/BS': {'/W': 1, '/S': '/S'}, '/MK': {'/BC': [1, 0, 0]}, '/A': {'/S': '/SubmitForm', '/F': {'/FS': '/URL', '/F': 'http://exampe.com'}}, '/AP': {'/N': IndirectObject(10, 0, 140368148709328), '/D': IndirectObject(13, 0, 140368148709328)}}
What pypdf does: Under the hood
Within _reader.py
there is the part:
if "/Fields" in tree:
fields = cast(ArrayObject, tree["/Fields"])
for f in fields:
field = f.get_object() # <--- if you print "f" here, you get the indirect object ID
self._build_field(field, retval, fileobj, field_attributes)
Given that, you can print:
from pypdf import PdfReader
reader = PdfReader("pdflatex-forms.pdf")
for page in reader.pages:
for annot in page.annotations:
print(annot)
which gives
IndirectObject(15, 0, 140141046791376)
IndirectObject(16, 0, 140141046791376)
IndirectObject(17, 0, 140141046791376)
The 3 numbers are:
- First (15/16/17): idnum - the Identifier for that indirect object
- Second (0 / 0 / 0): generation - the "version" of that object. Should be 0 in most PDFs.
- Third ( 140141046791376 / 140141046791376 / 140141046791376): A reference back to the PdfReader / PdfWriter
Some tinkering
For the widgets you can do something like this:
from pypdf import PdfReader
reader = PdfReader("pdflatex-forms.pdf")
widget_id2page_object = {}
for page_i, page in enumerate(reader.pages):
for annot in page.annotations:
widget_id2page_object[annot.idnum] = (page_i, annot)
print(widget_id2page_object)
giving
{15: (0, IndirectObject(15, 0, 140141046951824)), 16: (0, IndirectObject(16, 0, 140141046951824)), 17: (0, IndirectObject(17, 0, 140141046951824))}
then you can adjust _build_field
to pass the f.idnum
and add:
# the new line:
retval[key][NameObject("idnum")] = NumberObject(idnum)
# those are old:
if obj.get(FA.FT, "") == "/Ch":
retval[key][NameObject("/_States_")] = obj[NameObject(FA.Opt)]
Now, if you call get_fields
you get:
{'Name': {'/T': 'Name', '/FT': '/Tx', '/V': '', '/DV': '', 'idnum': 15},
'Check': {'/T': 'Check',
'/FT': '/Btn',
'/V': '/Off',
'idnum': 16,
'/_States_': ['/Yes', '/Off']},
'Submit': {'/T': 'Submit',
'/FT': '/Btn',
'/Ff': 65540,
'idnum': 17,
'/_States_': ['/Type',
'/Subtype',
'/BBox',
'/FormType',
'/Matrix',
'/Resources',
'/Filter',
'/Off']}}
Allowing you to take idnum to get to the annotation and thus to the page.
How do we continue?
Now the questions are:
- Is there already a simpler way?
- How/for what do people use
get_fields
at the moment? Could we simply add this type of information? - What is a clean way to add this feature?
Thank you Martin for this nice clarification.
According to the PDF specification (Document management — Portable document format — Part 1: PDF 1.7), if I understand correctly, an interactive form does not always has a widget annotation.
source : 12.7. Interactive Forms, in 12.7.1 General
A field’s children in the hierarchy may also include widget annotations (see 12.5.6.19, “Widget Annotations”) that define its appearance on the page
Maybe I should not rely on annotation to find the page number of a field?
- What is a clean way to add this feature?
The PDF Format page explains that each object has a numerical ID, which certainly also includes field objects, so maybe we could add a function that return the PageObject of a PdfObject?
def get_page_object(PdfObject) -> Optional[PageObject]
Things are a little more complex in PDFs: There is the Fields which are objects hierarchically organized under the acroform/fields which are page independants and the XObject (Widget) annotation which are the actual visual objects within the pages. these are within the page annotations are therefore page belonging. linking exists between the field and the widget in 2 manners: a) the two objects are merged, having all widgets and fields properties mixed together b) the two objects are linked, with the "/Parent" property in the widget pointing(indirectobject) to the field object and the widget pointed within the Kids Array inside the field object.
This last case is standard when you have radio buttons where one field only stores the choice value and multiple visual objects (radio toggles) are attached in the Kids property. I see no objection to have the radio toggles on different pages (example in a form where the different pages will correspond to mutually exclusive subforms) Fields repeated on multiple pages uses also a common field parent.
Also I see no reason (but maybe I'm wrong) to have a field without any widget : these woul be purely page independent.
therefore, if we implement a get_page function from my point of view it should return a tuple of pages.
a) ()
=> no page linked
b) (page_obj0)
=> 1 page
c) (page_obj0,page_obj1,page_obj2)
=> for multiple widgets
Do we agree ?
complement: the standard provides an optional property /page
in widgets that should point to the page. using some lookup to find the page the annotation belongs too should only be engaged if missing
edit: this PDF https://github.com/py-pdf/pypdf/files/14031491/Form_Structure_v50.pdf from #2425 shows some good examples
@MartinThoma, @stefan6419846 @sbourlon Do you agree with this proposal ?
If I understand it correctly, the proposed implementation is to enrich the form fields class with a method get_page_indices
? Or do we want to implement this on the PdfObject
base class?
I am against using get_pages
if we indeed just return (zero-based) page indices and thus propose def get_page_indices(self: PdfObject) -> List[int]
instead.
I am against using
get_pages
if we indeed just return (zero-based) page indices and thus proposedef get_page_indices(self: PdfObject) -> List[int]
instead.
can you clarify what you dislike in get_pages return list of pages objects ? the function should be in PdfReader : def get_page_using_field(self, fld: PdfObject) -> List[PageObject]:
can you clarify what you dislike in get_pages return list of pages objects ?
I have misread your message and thought we would just return the page indices - this would have been a misleading name, while it is completely fine for PageObject
s.
def get_page_using_field(self, field: PdfObject) -> List[PageObject]
would indeed work, although I am not sure whether the using field
here for naming purposes does make sense if we can pass all(?) types of PdfObject
s.
def get_page_using_field(self, field: PdfObject) -> List[PageObject]
would indeed work, although I am not sure whether the usingfield
here for naming purposes does make sense if we can pass all(?) types ofPdfObject
s.
we can include some code to confirm it is a field (checking for expected properties)