docling Form-Filled PDF extractions

Question

How can I ensure that form filled data is present in the images of the PDF pages?

Hi there,

I am attempting to use Docling as part of an attribute extraction framework. I need to be able to handle attributes that may be inputted in form filled PDFs. I have seen that this is possible to extract the form filled data when outputting as markdown, when I have this as my pipeline parameter with a python implementation:

-- Set up pipeline options with the given resolution pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True pipeline_options.images_scale = resolution pipeline_options.generate_page_images = True pipeline_options.generate_picture_images = True pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE pipeline_options.ocr_options = RapidOcrOptions()

-- Initialize document converter doc_converter = DocumentConverter( format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)} )

-- Convert the input file conversion_result = doc_converter.convert(input_file)

-- Save the JSON representation of the document docling_doc = conversion_result.document json_output_path = os.path.join(docling_folder, "doc.json") with open(json_output_path, "w") as fp: fp.write(json.dumps(docling_doc.export_to_dict()))

-- Save the Markdown file markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED') markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md") with open(markdown_output_path, "w") as fp: fp.write(markdown_content)

-- Save images for each page for page_no, page in conversion_result.document.pages.items(): page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png") with open(page_image_filename, "wb") as fp: page.image.pil_image.save(fp, format="PNG")

I have found that:

pipeline_options.table_structure_options.do_cell_matching = True

means it will be present in the markdown (despite the form filled aspect of this pdf not being a table).

However, when I extract images of the pages of the PDF, this form filled data is missing, and I am missing all the attributes I am looking to extract.

Is there a way that I can ensure that the form filled data will be present in the images of the pdf pages? Are there parameters in the pipeline that would enable this?

Thanks

Jan 03 '25 17:01 jackdorney1999

@jackdorney1999 Hi, can you please attach an example document and the minimal code to reproduce your issue? Thanks.

Jan 06 '25 11:01 cau-git

@cau-git here is the code that I am using for this:

#Docling Imports
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions,  TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem, DoclingDocument

def extraction_pipeline_rapid_ocr(input_file, output_dir, resolution):
    start_time = time.time()

    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Get the name of the PDF (without extension) for folder structure
    pdf_name = Path(input_file).stem
    pdf_output_dir = os.path.join(output_dir, pdf_name)

    # Create nested folders for the specific PDF with resolution
    docling_folder = os.path.join(pdf_output_dir, f'DoclingDocument_{resolution}')
    markdown_folder = os.path.join(pdf_output_dir, f'Markdown_{resolution}')
    image_folder = os.path.join(pdf_output_dir, f'Images_{resolution}')

    os.makedirs(docling_folder, exist_ok=True)
    os.makedirs(markdown_folder, exist_ok=True)
    os.makedirs(image_folder, exist_ok=True)

    # Set up pipeline options with the given resolution
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.images_scale = resolution
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True
    pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
    pipeline_options.ocr_options = RapidOcrOptions()

    # Initialize document converter
    doc_converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )

    # Convert the input file
    conversion_result = doc_converter.convert(input_file)

    # Save the JSON representation of the document
    docling_doc = conversion_result.document
    json_output_path = os.path.join(docling_folder, "doc.json")
    with open(json_output_path, "w") as fp:
        fp.write(json.dumps(docling_doc.export_to_dict()))

    # Save the Markdown file
    markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED')
    markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md")
    with open(markdown_output_path, "w") as fp:
        fp.write(markdown_content)

    # Save images for each page
    for page_no, page in conversion_result.document.pages.items():
        page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png")
        with open(page_image_filename, "wb") as fp:
            page.image.pil_image.save(fp, format="PNG")


    end_time = time.time() - start_time
    logging.info(f"Document converted and files exported in {end_time:.2f} seconds.")
    
extraction_pipeline_rapid_ocr(
input_file='sample_pdf.pdf',
output_dir='outputs/github_test',
resolution=1.0
)

This also gave inconsistent markdown:

## Sample Fillable PDF Form

Fillable  PDF  forms  can  be  customised  to  your  needs.  They  allow  form  recipients  to  fill  out information on screen like a web page form, then print, save or email the results.

Name

Date

Address

## Fillable Fields

What are your favourite activities? Reading Walking Music Other: /Yes /Yes

## Tick Boxes (multiple options can be selected)

What is your favourite activity? Reading Walking Music Other:

## Radio Buttons (only one option can be selected)

These buttons can be printable or visible only when onscreen.

## Buttons (to prompt certain actions)

Test 123

Jan

1 2012

1, springfield road, uk

<!-- image -->

Please find the example pdf, and the extracted image attached sample_pdf.pdf sample_pdf-page-1

Jan 07 '25 16:01 jackdorney1999

@jackdorney1999 The pdf parser should be able to extract text from the filled field as well as know if it comes from a filled out field. I will sync with @cau-git how we can propagate it through the docling pipeline.

Jan 11 '25 13:01 PeterStaar-IBM

@PeterStaar-IBM Thanks for reaching out, please let me know if you require any further information from me

Jan 16 '25 09:01 jackdorney1999