pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

getFields returning '/V': {'/Filter': ['/FlateDecode']} instead of form field contents on long field

Open Fjacquette opened this issue 7 years ago • 2 comments

Greetings,

I have a PDF form with a text field that sometimes contains a large amount of text (> 100K). When the amount of text is small, getFields in PDFFileReader returns the form field name and contents as expected. However, when the text gets very large, it comes back with '/V': {'/Filter': ['/FlateDecode']} instead. I have confirmed that the large chunk of text is indeed in the PDF file as expected, but I can't figure out why it's pulling back that string instead of the value.

I'm attaching a very simple PDF file that contains enough text in a single form to demonstrate the issue.

Simple form.pdf

This code snippet demonstrates how I'm trying to use the data from that field:

                # Extract the fields from the PDF
                input_file = PdfFileReader(open(local_filename, "rb"))

                # print how many pages input has:
                status(local_filename + " has " + str(input_file.getNumPages()) + " pages.\n")

                # Get the form fields from the input file; returns a dict type
                try:
                    my_fields = input_file.getFields()

                except Exception as myerr:
                    status("[ERROR]: "+str(myerr)+"\n")

                if "TimeStampData" in my_fields:
                    my_time_stamps = my_fields["TimeStampData"]
                    print("my_time_stamps  = " + my_time_stamps + "\n")

The output I get from that snippet is:

my_time_stamps = "/FT": "/Tx", "/T": "TimeStampData", "/Ff": 4096, "/V": {"/Filter": ["/FlateDecode"]}

Fjacquette avatar Apr 25 '18 18:04 Fjacquette

Further investigation has taken me down into the depths of the stream reading code. getFields calls _buildField, which calls Field's initializer, which calls getObject, which calls readObject, which calls readFromStream... it seems like either readObject or readFromStream should be recognizing that the next thing being read is deflated, deflate it, and pass it back up the chain.

My head is spinning.

Fjacquette avatar Apr 25 '18 20:04 Fjacquette

(Continued notes to self and anyone who takes a look at it.)

The very long form field text is contained in an indirect object. The call stack looks like this:

PdfFileReader.getFields PdfFileReader._buildField Field.init DictionaryObject.getItem IndirectObject.getObject

getObject is correctly reading the data in from the stream and sticking the very long text string into a _data property of the dictionary that's getting passed back up the stack. However, there's no point at which it checks for the /Filters flag or decodes the very long text string back into readable text. I'm sticking come code into Field.init to do this and see if it works, but that seems like the wrong place to do this.

Fjacquette avatar Apr 26 '18 20:04 Fjacquette

This is interesting because I'm running into a very similar issue in 2023. I was hoping this thread would contain a conclusion. Don't suppose you ever found a solution?

awxk avatar Jan 10 '23 16:01 awxk

My apologies, but I did not. I haven't worked with that code since 2018, so at this point it's all fled my brain. Best wishes figuring it out; from the tags added to this thread it looks like it's an unaddressed bug.

Fjacquette avatar Jan 10 '23 17:01 Fjacquette

This is interesting because I'm running into a very similar issue in 2023. I was hoping this thread would contain a conclusion. Don't suppose you ever found a solution?

the "/Filter" indicates a contentstream (you should confirm that using type function) you will be able to get the compressed data using get_data() : this will return a byte that you can convert to a string with decode()

pubpub-zz avatar Jan 10 '23 17:01 pubpub-zz

The fix was just merged an will be part of pypdf>3.4.1

MartinThoma avatar Feb 25 '23 05:02 MartinThoma