docling Inconsistent Markdown Output with generate_multimodal

Bug

I’ve encountered a recurring issue with the Docling library where the Markdown output generated from PDF documents is inconsistent. Specifically, when using the generate_multimodal_pages method, I frequently find that content is randomly lost in the content_md column. This has happened with multiple PDF files, leading to incomplete and poorly structured Markdown outputs, especially for tables and complex layouts. The lost content is usually found in the ocr_content column.

Steps to reproduce

Use the following code to convert a PDF document directly to Markdown:

source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

This method produces a complete and well-structured Markdown representation of the PDF.

Use the following code to generate Markdown page by page:

import pandas as pd
import datetime
from pathlib import Path

def main():
    logging.basicConfig(level=logging.INFO)

    input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
    output_dir = Path("scratch")

    pipeline_options = PdfPipelineOptions()
    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_page_images = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    conv_res = doc_converter.convert(input_doc_path)

    output_dir.mkdir(parents=True, exist_ok=True)

    rows = []
    for (
        content_text,
        content_md,
        content_dt,
        page_cells,
        page_segments,
        page,
    ) in generate_multimodal_pages(conv_res):
        dpi = page._default_image_scale * 72

        rows.append(
            {
                "document": conv_res.input.file.name,
                "hash": conv_res.input.document_hash,
                "page_hash": create_hash(
                    conv_res.input.document_hash + ":" + str(page.page_no - 1)
                ),
                "image": {
                    "width": page.image.width,
                    "height": page.image.height,
                    "bytes": page.image.tobytes(),
                },
                "cells": page_cells,
                "contents": content_text,
                "contents_md": content_md,
                "contents_dt": content_dt,
                "segments": page_segments,
                "extra": {
                    "page_num": page.page_no + 1,
                    "width_in_points": page.size.width,
                    "height_in_points": page.size.height,
                    "dpi": dpi,
                },
            }
        )

    # Generate one Excel file from all documents
    df_result = pd.json_normalize(rows)
    now = datetime.datetime.now()
    output_filename = output_dir / f"multimodal_{now:%Y-%m-%d_%H%M%S}.xlsx"
    df_result.to_excel(output_filename, index=False)

Compare the content_md column with the output of the original Markdown; you should see that they don't match.
The issue is more prevalent with PDFs that contain many tables.

Python version

3.12

May 13 '25 08:05 Yash8745

@Yash8745 I can not verify this right now but I know that the generate_multimodal_pages utility is terribly outdated, working on a legacy representation of the docling output. It clearly needs an update to support the current export methods. Would you volunteer to look into this?

May 21 '25 12:05 cau-git

Also stumbled upon this myself too, I ended up doing this, which is kinda dodgy, but works well enough:

pages = result.document.export_to_markdown(
        page_break_placeholder="<PAGEBREAK/>"
    ).split("<PAGEBREAK/>")

Jun 10 '25 02:06 tomtomau

Inconsistent Markdown Output with generate_multimodal_pages Method in Docling

Bug

Steps to reproduce

Python version