pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Getting the page number of a field

Open sbourlon opened this issue 4 months ago • 5 comments

Explanation

I am trying to update each field of a PDF file with some data but I don't know which PageObject to use after getting the list of fields from the function PdfReader.get_field().

Code Example

How would your feature be used? (Remove this if it is not applicable.)

Maybe with a new function:

PdfReader.get_form_page(self, field: Field) -> tuple[int, PageObject]

Return:

  • the page number
  • the page object

Example:

from pypdf import PdfReader

reader = PdfReader("form.pdf")
fields = reader.get_fields()

page_number, page = reader.get_form_page(fields[0])

sbourlon avatar Feb 10 '24 00:02 sbourlon

Thanks for the remark. The forms documentation should probably be expanded to explain the relationship between AcroForm and Annotations better. I'm not an exper there myself, but here is what I understood:

AcroForm and Annotation

/AcroForm is defined in the trailer and looks like this:

  /Fields [ 15 0 R 16 0 R 17 0 R ]
  /DR <<
    /Font <<
      /ZaDb 5 0 R
      /Helv 6 0 R
    >>
  >>
  /DA (/Helv 10 Tf 0 g)
  /NeedAppearances true

This form has 3 fields. The fields have corresponding annotations which are defined on the pages on which they appear. In my case:

15 0 obj
<<
  /Type /Annot
  /Rect [ 182.198 650.66 269.23 668.194 ]
  /Subtype /Widget
  /F 4
  /T <FEFF004E0061006D0065>
  /FT /Tx
  /Q 0
  /BS <<
    /W 1
    /S /S
  >>
  /MK <<
    /BC [ 1 0 0 ]
    /BG [ 1 1 1 ]
  >>
  /DA (/Helv 10 Tf 0 0 0 rg)
  /DV ()
  /V ()
>>
endobj

16 0 obj
<<
  /Type /Annot
  /Rect [ 183.582 623.163 195.537 640.697 ]
  /Subtype /Widget
  /F 4
  /T <FEFF0043006800650063006B>
  /FT /Btn
  /Q 0
  /BS <<
    /W 1
    /S /S
  >>
  /AP <<
    /N <<
      /Yes <<
      >>
    >>
  >>
  /MK <<
    /BC [ 1 0 0 ]
    /BG [ 1 1 1 ]
    /CA (4)
  >>
  /DA (/ZaDb 10 Tf 0 0 0 rg)
  /H /P
  /V /Off
  /AS /Off
>>
endobj

17 0 obj
<<
  /Type /Annot
  /Rect [ 153.694 598.703 189.235 613.2 ]
  /Subtype /Widget
  /F 4
  /T (Submit)
  /FT /Btn
  /Ff 65540
  /H /P
  /BS <<
    /W 1
    /S /S
  >>
  /MK <<
    /BC [ 1 0 0 ]
  >>
  /A <<
    /S /SubmitForm
    /F <<
      /FS /URL
      /F (http://exampe.com)
    >>
  >>
  /AP <<
    /N 10 0 R
    /D 13 0 R
  >>
>>
endobj

What pypdf does: The current interface

For the PDF above, pypdf would give:

>>> from pypdf import PdfReader
>>> reader = PdfReader("pdflatex-forms.pdf")
>>> fields = reader.get_fields()
>>> fields
{'Name': {'/T': 'Name', '/FT': '/Tx', '/V': '', '/DV': ''},
'Check': {'/T': 'Check', '/FT': '/Btn', '/V': '/Off', '/_States_': ['/Yes', '/Off']},
'Submit': {'/T': 'Submit', '/FT': '/Btn', '/Ff': 65540, '/_States_': ['/Type', '/Subtype', '/BBox', '/FormType', '/Matrix', '/Resources', '/Filter', '/Off']}}

and

>>> from pypdf import PdfReader
>>> reader = PdfReader("pdflatex-forms.pdf")
>>> for page in reader.pages:
...     for annot in page.annotations:
...         print(annot.get_object())
... 
{'/Type': '/Annot', '/Rect': [182.198, 650.66, 269.23, 668.194], '/Subtype': '/Widget', '/F': 4, '/T': 'Name', '/FT': '/Tx', '/Q': 0, '/BS': {'/W': 1, '/S': '/S'}, '/MK': {'/BC': [1, 0, 0], '/BG': [1, 1, 1]}, '/DA': '/Helv 10 Tf 0 0 0 rg', '/DV': '', '/V': ''}
{'/Type': '/Annot', '/Rect': [183.582, 623.163, 195.537, 640.697], '/Subtype': '/Widget', '/F': 4, '/T': 'Check', '/FT': '/Btn', '/Q': 0, '/BS': {'/W': 1, '/S': '/S'}, '/AP': {'/N': {'/Yes': {}}}, '/MK': {'/BC': [1, 0, 0], '/BG': [1, 1, 1], '/CA': '4'}, '/DA': '/ZaDb 10 Tf 0 0 0 rg', '/H': '/P', '/V': '/Off', '/AS': '/Off'}
{'/Type': '/Annot', '/Rect': [153.694, 598.703, 189.235, 613.2], '/Subtype': '/Widget', '/F': 4, '/T': 'Submit', '/FT': '/Btn', '/Ff': 65540, '/H': '/P', '/BS': {'/W': 1, '/S': '/S'}, '/MK': {'/BC': [1, 0, 0]}, '/A': {'/S': '/SubmitForm', '/F': {'/FS': '/URL', '/F': 'http://exampe.com'}}, '/AP': {'/N': IndirectObject(10, 0, 140368148709328), '/D': IndirectObject(13, 0, 140368148709328)}}

What pypdf does: Under the hood

Within _reader.py there is the part:

        if "/Fields" in tree:
            fields = cast(ArrayObject, tree["/Fields"])
            for f in fields:
                field = f.get_object()  # <--- if you print "f" here, you get the indirect object ID
                self._build_field(field, retval, fileobj, field_attributes)

Given that, you can print:

from pypdf import PdfReader
reader = PdfReader("pdflatex-forms.pdf")
for page in reader.pages:
    for annot in page.annotations:
        print(annot)

which gives

IndirectObject(15, 0, 140141046791376)
IndirectObject(16, 0, 140141046791376)
IndirectObject(17, 0, 140141046791376)

The 3 numbers are:

  • First (15/16/17): idnum - the Identifier for that indirect object
  • Second (0 / 0 / 0): generation - the "version" of that object. Should be 0 in most PDFs.
  • Third ( 140141046791376 / 140141046791376 / 140141046791376): A reference back to the PdfReader / PdfWriter

Some tinkering

For the widgets you can do something like this:

from pypdf import PdfReader

reader = PdfReader("pdflatex-forms.pdf")


widget_id2page_object = {}

for page_i, page in enumerate(reader.pages):
    for annot in page.annotations:
        widget_id2page_object[annot.idnum] = (page_i, annot)

print(widget_id2page_object)

giving

{15: (0, IndirectObject(15, 0, 140141046951824)), 16: (0, IndirectObject(16, 0, 140141046951824)), 17: (0, IndirectObject(17, 0, 140141046951824))}

then you can adjust _build_field to pass the f.idnum and add:

        # the new line:
        retval[key][NameObject("idnum")] = NumberObject(idnum)
        # those are old:
        if obj.get(FA.FT, "") == "/Ch":
            retval[key][NameObject("/_States_")] = obj[NameObject(FA.Opt)]

Now, if you call get_fields you get:

{'Name': {'/T': 'Name', '/FT': '/Tx', '/V': '', '/DV': '', 'idnum': 15},
 'Check': {'/T': 'Check',
  '/FT': '/Btn',
  '/V': '/Off',
  'idnum': 16,
  '/_States_': ['/Yes', '/Off']},
 'Submit': {'/T': 'Submit',
  '/FT': '/Btn',
  '/Ff': 65540,
  'idnum': 17,
  '/_States_': ['/Type',
   '/Subtype',
   '/BBox',
   '/FormType',
   '/Matrix',
   '/Resources',
   '/Filter',
   '/Off']}}

Allowing you to take idnum to get to the annotation and thus to the page.

How do we continue?

Now the questions are:

  1. Is there already a simpler way?
  2. How/for what do people use get_fields at the moment? Could we simply add this type of information?
  3. What is a clean way to add this feature?

MartinThoma avatar Feb 10 '24 09:02 MartinThoma

Thank you Martin for this nice clarification.

According to the PDF specification (Document management — Portable document format — Part 1: PDF 1.7), if I understand correctly, an interactive form does not always has a widget annotation.

source : 12.7. Interactive Forms, in 12.7.1 General

A field’s children in the hierarchy may also include widget annotations (see 12.5.6.19, “Widget Annotations”) that define its appearance on the page

Maybe I should not rely on annotation to find the page number of a field?

  1. What is a clean way to add this feature?

The PDF Format page explains that each object has a numerical ID, which certainly also includes field objects, so maybe we could add a function that return the PageObject of a PdfObject?

def get_page_object(PdfObject) -> Optional[PageObject]

sbourlon avatar Feb 12 '24 19:02 sbourlon

Things are a little more complex in PDFs: There is the Fields which are objects hierarchically organized under the acroform/fields which are page independants and the XObject (Widget) annotation which are the actual visual objects within the pages. these are within the page annotations are therefore page belonging. linking exists between the field and the widget in 2 manners: a) the two objects are merged, having all widgets and fields properties mixed together b) the two objects are linked, with the "/Parent" property in the widget pointing(indirectobject) to the field object and the widget pointed within the Kids Array inside the field object.

This last case is standard when you have radio buttons where one field only stores the choice value and multiple visual objects (radio toggles) are attached in the Kids property. I see no objection to have the radio toggles on different pages (example in a form where the different pages will correspond to mutually exclusive subforms) Fields repeated on multiple pages uses also a common field parent.

Also I see no reason (but maybe I'm wrong) to have a field without any widget : these woul be purely page independent.

therefore, if we implement a get_page function from my point of view it should return a tuple of pages. a) () => no page linked b) (page_obj0) => 1 page c) (page_obj0,page_obj1,page_obj2) => for multiple widgets

Do we agree ?

complement: the standard provides an optional property /page in widgets that should point to the page. using some lookup to find the page the annotation belongs too should only be engaged if missing

edit: this PDF https://github.com/py-pdf/pypdf/files/14031491/Form_Structure_v50.pdf from #2425 shows some good examples

pubpub-zz avatar Feb 25 '24 10:02 pubpub-zz

@MartinThoma, @stefan6419846 @sbourlon Do you agree with this proposal ?

pubpub-zz avatar Feb 28 '24 20:02 pubpub-zz

If I understand it correctly, the proposed implementation is to enrich the form fields class with a method get_page_indices? Or do we want to implement this on the PdfObject base class?

I am against using get_pages if we indeed just return (zero-based) page indices and thus propose def get_page_indices(self: PdfObject) -> List[int] instead.

stefan6419846 avatar Feb 29 '24 07:02 stefan6419846

I am against using get_pages if we indeed just return (zero-based) page indices and thus propose def get_page_indices(self: PdfObject) -> List[int] instead.

can you clarify what you dislike in get_pages return list of pages objects ? the function should be in PdfReader : def get_page_using_field(self, fld: PdfObject) -> List[PageObject]:

pubpub-zz avatar Feb 29 '24 11:02 pubpub-zz

can you clarify what you dislike in get_pages return list of pages objects ?

I have misread your message and thought we would just return the page indices - this would have been a misleading name, while it is completely fine for PageObjects.

def get_page_using_field(self, field: PdfObject) -> List[PageObject] would indeed work, although I am not sure whether the using field here for naming purposes does make sense if we can pass all(?) types of PdfObjects.

stefan6419846 avatar Feb 29 '24 11:02 stefan6419846

def get_page_using_field(self, field: PdfObject) -> List[PageObject] would indeed work, although I am not sure whether the using field here for naming purposes does make sense if we can pass all(?) types of PdfObjects.

we can include some code to confirm it is a field (checking for expected properties)

pubpub-zz avatar Feb 29 '24 11:02 pubpub-zz