docling icon indicating copy to clipboard operation
docling copied to clipboard

Inconsistent Markdown Output with generate_multimodal_pages Method in Docling

Open Yash8745 opened this issue 7 months ago • 1 comments

Bug

I’ve encountered a recurring issue with the Docling library where the Markdown output generated from PDF documents is inconsistent. Specifically, when using the generate_multimodal_pages method, I frequently find that content is randomly lost in the content_md column. This has happened with multiple PDF files, leading to incomplete and poorly structured Markdown outputs, especially for tables and complex layouts. The lost content is usually found in the ocr_content column.

Steps to reproduce

  1. Use the following code to convert a PDF document directly to Markdown:

    source = "https://arxiv.org/pdf/2408.09869"  # document per local path or URL
    converter = DocumentConverter()
    result = converter.convert(source)
    print(result.document.export_to_markdown())
    

    This method produces a complete and well-structured Markdown representation of the PDF.

  2. Use the following code to generate Markdown page by page:

    import pandas as pd
    import datetime
    from pathlib import Path
    
    def main():
        logging.basicConfig(level=logging.INFO)
    
        input_doc_path = Path("./tests/data/pdf/2206.01062.pdf")
        output_dir = Path("scratch")
    
        pipeline_options = PdfPipelineOptions()
        pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
        pipeline_options.generate_page_images = True
    
        doc_converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
            }
        )
    
        conv_res = doc_converter.convert(input_doc_path)
    
        output_dir.mkdir(parents=True, exist_ok=True)
    
        rows = []
        for (
            content_text,
            content_md,
            content_dt,
            page_cells,
            page_segments,
            page,
        ) in generate_multimodal_pages(conv_res):
            dpi = page._default_image_scale * 72
    
            rows.append(
                {
                    "document": conv_res.input.file.name,
                    "hash": conv_res.input.document_hash,
                    "page_hash": create_hash(
                        conv_res.input.document_hash + ":" + str(page.page_no - 1)
                    ),
                    "image": {
                        "width": page.image.width,
                        "height": page.image.height,
                        "bytes": page.image.tobytes(),
                    },
                    "cells": page_cells,
                    "contents": content_text,
                    "contents_md": content_md,
                    "contents_dt": content_dt,
                    "segments": page_segments,
                    "extra": {
                        "page_num": page.page_no + 1,
                        "width_in_points": page.size.width,
                        "height_in_points": page.size.height,
                        "dpi": dpi,
                    },
                }
            )
    
        # Generate one Excel file from all documents
        df_result = pd.json_normalize(rows)
        now = datetime.datetime.now()
        output_filename = output_dir / f"multimodal_{now:%Y-%m-%d_%H%M%S}.xlsx"
        df_result.to_excel(output_filename, index=False)
    
  3. Compare the content_md column with the output of the original Markdown; you should see that they don't match.

  4. The issue is more prevalent with PDFs that contain many tables.

Python version

3.12

Yash8745 avatar May 13 '25 08:05 Yash8745

@Yash8745 I can not verify this right now but I know that the generate_multimodal_pages utility is terribly outdated, working on a legacy representation of the docling output. It clearly needs an update to support the current export methods. Would you volunteer to look into this?

cau-git avatar May 21 '25 12:05 cau-git

Also stumbled upon this myself too, I ended up doing this, which is kinda dodgy, but works well enough:

pages = result.document.export_to_markdown(
        page_break_placeholder="<PAGEBREAK/>"
    ).split("<PAGEBREAK/>")

tomtomau avatar Jun 10 '25 02:06 tomtomau