pdf-reader Easiest way to get form fields from a pdf

I'm trying to parse a standard documents like w9 forms (https://www.irs.gov/pub/irs-pdf/fw9.pdf). I want to parse out the name which is the first form fields that is inputted by someone. Whats the easiest way to do that?

I've tried doing:

reader = PDF::Reader.new("W9.pdf")
objects = reader.objects
result = objects.deref!(reader.pages[0].attributes[:Annots])

When I take a look at result for a bunch of different w9s that have been filled it, there doesn't seem to be a single structure in the result variable that I can use to figure out the name. I know name is always going to be the first form field, is there an easy way to search for that?

Jun 30 '22 00:06 arahman4710

I'm confident that pdf-reader can deserialize the data you're after, but unfortunately I'm not personally very familiar with PDF forms or how the fields or data are stored.

The PDF spec says the optional Annots page attribute is an array of dictionaries, each dictionary is a single annotation and can have different properties depending on the type (link, line, square, circle, underline, file, sound, movie, 3d, etc). It sounds like forms use the :Widget annotation type:

Interactive forms (see 12.7, "Forms") use widget annotations (PDF 1.2) to represent the appearance of fields and to manage user interactions. As a convenience, when a field has only a single associated widget annotation, the contents of the field dictionary (12.7.4, "Field dictionaries") and the annotation dictionary may be merged into a single dictionary containing entries that pertain to both a field and an annotation.

Maybe filtering the :Annots array down to just :Widget annotations will yield some useful results?

I see there's also an :AcroForm property at the document level that might have some interesting data. Unfortunately it's not currently exposed by pdf-reader, but I'd happily accept a PR that adds it.

Something like this:

diff --git a/lib/pdf/reader.rb b/lib/pdf/reader.rb
index 22aea3d..8c3266b 100644
--- a/lib/pdf/reader.rb
+++ b/lib/pdf/reader.rb
@@ -142,6 +142,12 @@ def metadata
       end
     end
 
+    # Return a Hash with interactive form details from this file. Not always present
+    #
+    def acroform
+      @objects.deref_hash(root[:AcroForm])
+    end

Would allow:

PDF::Reader.open("somefile.pdf") do |pdf|
  puts pdf.acroform
end

Jul 02 '22 01:07 yob

Gotcha, filtering to look at only the :Widget annotation is helpful but even when i do that, I see that the value seems like it can be nested in :Parent attribute afterwards

:Parent=>{:FT=>:Tx, :Ff=>8388608, :Kids=>[{...}], :T=>"topmostSubform[0].Page1[0].f1_1[0]", :V=>"Vendor name "}

On one pdf, I was able to just look at the :V attribute on a :Widget annotation, but in the case above, it looks like I need to look at the :Parent attribute and i'm unsure when there'll be a :Parent I should look at and when I shouldn't

Jul 02 '22 23:07 arahman4710

` pdf = Base64.decode64(params['pdf'])

reader = PDF::Reader.new(StringIO.new(pdf))
reader.pages.each do |page|
  objects = page.objects
  result = objects.deref!(page.attributes[:Annots])
  result.each do |r|
    puts r[:T]
    puts r[:V]
  end
end

`

Sep 08 '22 01:09 Michael1969

Get all fields from a file using the low level API:

fields_from_pdf_form = PDF::Reader.new(file).pages.map do |page| 
  page.objects.deref!(page.attributes[:Annots])&.pluck(:T) 
end.flatten.compact_blank

UPDATE not all fields: skips radio button groups. But they are there inside the Annots. Need to find a way to collect these.

Feb 02 '23 17:02 ruinunes

Hello everybody, I came up with this script to extract acrofields:

require 'pdf-reader'

filename = ARGV[0]

# Check if the filename is provided
if filename.nil?
  puts "Please provide a PDF file name."
  exit 1
end

reader = PDF::Reader.new(filename)

# Access the catalog (root object) of the PDF through indirect reference
catalog_ref = reader.objects[reader.objects.trailer[:Root]]
acroform_ref = catalog_ref[:AcroForm]

# Exit if AcroForm is not found
if acroform_ref.nil?
  puts "No AcroForm found in the PDF."
  exit
end

acroform = reader.objects[acroform_ref]

# Check if AcroForm is present and has Fields
if acroform && acroform[:Fields]
  acroform[:Fields].each do |field_ref|
    field = reader.objects[field_ref]

    # Check if it's an AcroField with a name
    next unless field && field[:T]

    field_name = field[:T]
    # The position (Rect) might not be directly available in the field object
    field_rect = field[:Rect]

    puts "Field '#{field_name}' at position #{field_rect}"
  end
else
  puts "No AcroFields found."
end

This seems to work. I thought it might be useful for you as well.

Keep up the good work everybody :-)

Dec 07 '23 14:12 cokron