Easiest way to get form fields from a pdf
I'm trying to parse a standard documents like w9 forms (https://www.irs.gov/pub/irs-pdf/fw9.pdf). I want to parse out the name which is the first form fields that is inputted by someone. Whats the easiest way to do that?
I've tried doing:
reader = PDF::Reader.new("W9.pdf")
objects = reader.objects
result = objects.deref!(reader.pages[0].attributes[:Annots])
When I take a look at result for a bunch of different w9s that have been filled it, there doesn't seem to be a single structure in the result variable that I can use to figure out the name. I know name is always going to be the first form field, is there an easy way to search for that?
I'm confident that pdf-reader can deserialize the data you're after, but unfortunately I'm not personally very familiar with PDF forms or how the fields or data are stored.
The PDF spec says the optional Annots page attribute is an array of dictionaries, each dictionary is a single annotation and can have different properties depending on the type (link, line, square, circle, underline, file, sound, movie, 3d, etc). It sounds like forms use the :Widget annotation type:
Interactive forms (see 12.7, "Forms") use widget annotations (PDF 1.2) to represent the appearance of fields and to manage user interactions. As a convenience, when a field has only a single associated widget annotation, the contents of the field dictionary (12.7.4, "Field dictionaries") and the annotation dictionary may be merged into a single dictionary containing entries that pertain to both a field and an annotation.
Maybe filtering the :Annots array down to just :Widget annotations will yield some useful results?
I see there's also an :AcroForm property at the document level that might have some interesting data. Unfortunately it's not currently exposed by pdf-reader, but I'd happily accept a PR that adds it.
Something like this:
diff --git a/lib/pdf/reader.rb b/lib/pdf/reader.rb
index 22aea3d..8c3266b 100644
--- a/lib/pdf/reader.rb
+++ b/lib/pdf/reader.rb
@@ -142,6 +142,12 @@ def metadata
end
end
+ # Return a Hash with interactive form details from this file. Not always present
+ #
+ def acroform
+ @objects.deref_hash(root[:AcroForm])
+ end
Would allow:
PDF::Reader.open("somefile.pdf") do |pdf|
puts pdf.acroform
end
Gotcha, filtering to look at only the :Widget annotation is helpful but even when i do that, I see that the value seems like it can be nested in :Parent attribute afterwards
:Parent=>{:FT=>:Tx, :Ff=>8388608, :Kids=>[{...}], :T=>"topmostSubform[0].Page1[0].f1_1[0]", :V=>"Vendor name "}
On one pdf, I was able to just look at the :V attribute on a :Widget annotation, but in the case above, it looks like I need to look at the :Parent attribute and i'm unsure when there'll be a :Parent I should look at and when I shouldn't
` pdf = Base64.decode64(params['pdf'])
reader = PDF::Reader.new(StringIO.new(pdf))
reader.pages.each do |page|
objects = page.objects
result = objects.deref!(page.attributes[:Annots])
result.each do |r|
puts r[:T]
puts r[:V]
end
end
`
Get all fields from a file using the low level API:
fields_from_pdf_form = PDF::Reader.new(file).pages.map do |page|
page.objects.deref!(page.attributes[:Annots])&.pluck(:T)
end.flatten.compact_blank
UPDATE not all fields: skips radio button groups. But they are there inside the Annots. Need to find a way to collect these.
Hello everybody, I came up with this script to extract acrofields:
require 'pdf-reader'
filename = ARGV[0]
# Check if the filename is provided
if filename.nil?
puts "Please provide a PDF file name."
exit 1
end
reader = PDF::Reader.new(filename)
# Access the catalog (root object) of the PDF through indirect reference
catalog_ref = reader.objects[reader.objects.trailer[:Root]]
acroform_ref = catalog_ref[:AcroForm]
# Exit if AcroForm is not found
if acroform_ref.nil?
puts "No AcroForm found in the PDF."
exit
end
acroform = reader.objects[acroform_ref]
# Check if AcroForm is present and has Fields
if acroform && acroform[:Fields]
acroform[:Fields].each do |field_ref|
field = reader.objects[field_ref]
# Check if it's an AcroField with a name
next unless field && field[:T]
field_name = field[:T]
# The position (Rect) might not be directly available in the field object
field_rect = field[:Rect]
puts "Field '#{field_name}' at position #{field_rect}"
end
else
puts "No AcroFields found."
end
This seems to work. I thought it might be useful for you as well.
Keep up the good work everybody :-)