Form-Filled PDF extractions
Question
How can I ensure that form filled data is present in the images of the PDF pages?
Hi there,
I am attempting to use Docling as part of an attribute extraction framework. I need to be able to handle attributes that may be inputted in form filled PDFs. I have seen that this is possible to extract the form filled data when outputting as markdown, when I have this as my pipeline parameter with a python implementation:
-- Set up pipeline options with the given resolution pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True pipeline_options.images_scale = resolution pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE pipeline_options.ocr_options = RapidOcrOptions()
-- Initialize document converter doc_converter = DocumentConverter( format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} )
-- Convert the input file conversion_result = doc_converter.convert(input_file)
-- Save the JSON representation of the document docling_doc = conversion_result.document json_output_path = os.path.join(docling_folder, "doc.json") with open(json_output_path, "w") as fp: fp.write(json.dumps(docling_doc.export_to_dict()))
-- Save the Markdown file markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED') markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md") with open(markdown_output_path, "w") as fp: fp.write(markdown_content)
-- Save images for each page for page_no, page in conversion_result.document.pages.items(): page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png") with open(page_image_filename, "wb") as fp: page.image.pil_image.save(fp, format="PNG")
I have found that:
pipeline_options.table_structure_options.do_cell_matching = True
means it will be present in the markdown (despite the form filled aspect of this pdf not being a table).
However, when I extract images of the pages of the PDF, this form filled data is missing, and I am missing all the attributes I am looking to extract.
Is there a way that I can ensure that the form filled data will be present in the images of the pdf pages? Are there parameters in the pipeline that would enable this?
Thanks
@jackdorney1999 Hi, can you please attach an example document and the minimal code to reproduce your issue? Thanks.
@cau-git here is the code that I am using for this:
#Docling Imports
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions, TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem, DoclingDocument
def extraction_pipeline_rapid_ocr(input_file, output_dir, resolution):
start_time = time.time()
# Ensure output directory exists
os.makedirs(output_dir, exist_ok=True)
# Get the name of the PDF (without extension) for folder structure
pdf_name = Path(input_file).stem
pdf_output_dir = os.path.join(output_dir, pdf_name)
# Create nested folders for the specific PDF with resolution
docling_folder = os.path.join(pdf_output_dir, f'DoclingDocument_{resolution}')
markdown_folder = os.path.join(pdf_output_dir, f'Markdown_{resolution}')
image_folder = os.path.join(pdf_output_dir, f'Images_{resolution}')
os.makedirs(docling_folder, exist_ok=True)
os.makedirs(markdown_folder, exist_ok=True)
os.makedirs(image_folder, exist_ok=True)
# Set up pipeline options with the given resolution
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.images_scale = resolution
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
pipeline_options.ocr_options = RapidOcrOptions()
# Initialize document converter
doc_converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
# Convert the input file
conversion_result = doc_converter.convert(input_file)
# Save the JSON representation of the document
docling_doc = conversion_result.document
json_output_path = os.path.join(docling_folder, "doc.json")
with open(json_output_path, "w") as fp:
fp.write(json.dumps(docling_doc.export_to_dict()))
# Save the Markdown file
markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED')
markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md")
with open(markdown_output_path, "w") as fp:
fp.write(markdown_content)
# Save images for each page
for page_no, page in conversion_result.document.pages.items():
page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png")
with open(page_image_filename, "wb") as fp:
page.image.pil_image.save(fp, format="PNG")
end_time = time.time() - start_time
logging.info(f"Document converted and files exported in {end_time:.2f} seconds.")
extraction_pipeline_rapid_ocr(
input_file='sample_pdf.pdf',
output_dir='outputs/github_test',
resolution=1.0
)
This also gave inconsistent markdown:
## Sample Fillable PDF Form
Fillable PDF forms can be customised to your needs. They allow form recipients to fill out information on screen like a web page form, then print, save or email the results.
Name
Date
Address
## Fillable Fields
What are your favourite activities? Reading Walking Music Other: /Yes /Yes
## Tick Boxes (multiple options can be selected)
What is your favourite activity? Reading Walking Music Other:
## Radio Buttons (only one option can be selected)
These buttons can be printable or visible only when onscreen.
## Buttons (to prompt certain actions)
Test 123
Jan
1 2012
1, springfield road, uk
<!-- image -->
Please find the example pdf, and the extracted image attached
sample_pdf.pdf
@jackdorney1999 The pdf parser should be able to extract text from the filled field as well as know if it comes from a filled out field. I will sync with @cau-git how we can propagate it through the docling pipeline.
@PeterStaar-IBM Thanks for reaching out, please let me know if you require any further information from me