pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

New function: get_form_fields_in_list()

Open pascal-brand38 opened this issue 1 year ago • 9 comments

Get form fields, using a list of labels, such as: get_form_fields_in_list([ '/Tx', '/Btn' ]) to get text and button (checkbox,...) values

Signed-off-by: Pascal Brand [email protected]

pascal-brand38 avatar Aug 10 '22 17:08 pascal-brand38

Codecov Report

Merging #1220 (3c58146) into main (2df8a4c) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1220   +/-   ##
=======================================
  Coverage   92.43%   92.43%           
=======================================
  Files          24       24           
  Lines        4941     4943    +2     
  Branches     1025     1025           
=======================================
+ Hits         4567     4569    +2     
  Misses        234      234           
  Partials      140      140           
Impacted Files Coverage Δ
PyPDF2/_reader.py 90.69% <100.00%> (+0.02%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Aug 10 '22 17:08 codecov[bot]

Thank you for your contribution 🙏

MartinThoma avatar Aug 10 '22 21:08 MartinThoma

I want to avoid adding more and more methods. Although this adds flexibility, it does confuse. Instead, we should make a decision in which context those functions will be used and make them fit best for that.

In it's current form I don't think I will mehr it, but I appreciate that your pr does not make a breaking change and the discussion it starts. Let's use it to have that discussion :-)

MartinThoma avatar Aug 10 '22 21:08 MartinThoma

Do you happen to need that function? What do you need it for?

MartinThoma avatar Aug 10 '22 21:08 MartinThoma

Hi Martin,

I have to retrieve pdf form information, not only text ones, but also checkbox status (on/off).

I had a look at stackoverflow, such as https://stackoverflow.com/questions/51185296/reading-checkbox-pdf-with-python-on-a-filled-out-form or https://stackoverflow.com/questions/3984003/how-to-extract-pdf-fields-from-a-filled-out-form-in-python Unfortunately, the answers were not as easy as the simple get_form_text_fields() api function, for which you do not have to understand the pdf structure, and dig too much in the code.

This is why I added this function.

I use it with get_form_fields_in_list(['/Tx', '/Btn']), so that I have text and checkbox values in the result.

Of course, I could have added a new function such as get_form_checkbox_fields(), but I did not want to add too many methods ('cos then someone would like to have get_form_text_and_checkbox_fields() too, plus others I am not aware of).

pascal-brand38 avatar Aug 11 '22 15:08 pascal-brand38

I would actually rather remove get_form_text_fields from PyPDF2 completely as I think this can easily done by PyPDF2 users.

Instead, we might want to extend the help of get_fields educating the user ... or return something more meaningful than just a dictionary.

I'm uncertain about this.

MartinThoma avatar Aug 12 '22 18:08 MartinThoma

I would neither remove get_form_text_fields() nor change its returned format.

Following https://github.com/py-pdf/PyPDF2/discussions/850 discussion, I am on the same line than @MasterOdin and @sekrause, that is legacy is important and API break is a pain for people using the library and people reading example code (for example from stackoverflow - old correct answers would not work anymore).

I guess that such consideration should be discussed with a broader range of people, both people updating pyPDF2 and people using it. I would be happy to participate.

Anyway, it is bit beyond this PR I suggest. Feel free to close this PR without merging it.

pascal-brand38 avatar Aug 13 '22 16:08 pascal-brand38

I agree that we should hear more opinions :+1: I will leave this PR open for discussions.

@MasterOdin / @pubpub-zz : As always, I'm very interested in your opinion :-)

MartinThoma avatar Aug 13 '22 17:08 MartinThoma

Agree with @pascal-brand38 About keeping legacy interfaces

pubpub-zz avatar Aug 13 '22 17:08 pubpub-zz

@pascal-brand38 Thank you for working on improving PyPDF2, but I've finally decided to close this PR.

The PR added get_form_fields_in_list which can be easily implemented by the users:

def get_form_fields_in_list(reader):
    formfields = reader.get_fields()
    return {
            formfields[field]["/T"]: formfields[field].get("/V")
            for field in formfields
            if formfields[field].get("/FT") in ft_list
        }

The user still needs to have knowledge about the types of fields, meaning that this function doesn't make the interaction with PDF files simpler. Those two arguments combined lead me to the conclusion that it's better to not extend the public interface of PyPDF2 with this method (for the moment at least).

MartinThoma avatar Dec 10 '22 08:12 MartinThoma