getFields returning '/V': {'/Filter': ['/FlateDecode']} instead of form field contents on long field
Greetings,
I have a PDF form with a text field that sometimes contains a large amount of text (> 100K). When the amount of text is small, getFields in PDFFileReader returns the form field name and contents as expected. However, when the text gets very large, it comes back with '/V': {'/Filter': ['/FlateDecode']} instead. I have confirmed that the large chunk of text is indeed in the PDF file as expected, but I can't figure out why it's pulling back that string instead of the value.
I'm attaching a very simple PDF file that contains enough text in a single form to demonstrate the issue.
This code snippet demonstrates how I'm trying to use the data from that field:
# Extract the fields from the PDF
input_file = PdfFileReader(open(local_filename, "rb"))
# print how many pages input has:
status(local_filename + " has " + str(input_file.getNumPages()) + " pages.\n")
# Get the form fields from the input file; returns a dict type
try:
my_fields = input_file.getFields()
except Exception as myerr:
status("[ERROR]: "+str(myerr)+"\n")
if "TimeStampData" in my_fields:
my_time_stamps = my_fields["TimeStampData"]
print("my_time_stamps = " + my_time_stamps + "\n")
The output I get from that snippet is:
my_time_stamps = "/FT": "/Tx", "/T": "TimeStampData", "/Ff": 4096, "/V": {"/Filter": ["/FlateDecode"]}
Further investigation has taken me down into the depths of the stream reading code. getFields calls _buildField, which calls Field's initializer, which calls getObject, which calls readObject, which calls readFromStream... it seems like either readObject or readFromStream should be recognizing that the next thing being read is deflated, deflate it, and pass it back up the chain.
My head is spinning.
(Continued notes to self and anyone who takes a look at it.)
The very long form field text is contained in an indirect object. The call stack looks like this:
PdfFileReader.getFields PdfFileReader._buildField Field.init DictionaryObject.getItem IndirectObject.getObject
getObject is correctly reading the data in from the stream and sticking the very long text string into a _data property of the dictionary that's getting passed back up the stack. However, there's no point at which it checks for the /Filters flag or decodes the very long text string back into readable text. I'm sticking come code into Field.init to do this and see if it works, but that seems like the wrong place to do this.
This is interesting because I'm running into a very similar issue in 2023. I was hoping this thread would contain a conclusion. Don't suppose you ever found a solution?
My apologies, but I did not. I haven't worked with that code since 2018, so at this point it's all fled my brain. Best wishes figuring it out; from the tags added to this thread it looks like it's an unaddressed bug.
This is interesting because I'm running into a very similar issue in 2023. I was hoping this thread would contain a conclusion. Don't suppose you ever found a solution?
the "/Filter" indicates a contentstream (you should confirm that using type function)
you will be able to get the compressed data using get_data() : this will return a byte that you can convert to a string with decode()
The fix was just merged an will be part of pypdf>3.4.1