Inconsistent Markdown Output with generate_multimodal_pages Method in Docling
Bug
I’ve encountered a recurring issue with the Docling library where the Markdown output generated from PDF documents is inconsistent. Specifically, when using the generate_multimodal_pages method, I frequently find that content is randomly lost in the content_md column. This has happened with multiple PDF files, leading to incomplete and poorly structured Markdown outputs, especially for tables and complex layouts. The lost content is usually found in the ocr_content column.
Steps to reproduce
-
Use the following code to convert a PDF document directly to Markdown:
source = "https://arxiv.org/pdf/2408.09869" # document per local path or URL converter = DocumentConverter() result = converter.convert(source) print(result.document.export_to_markdown())This method produces a complete and well-structured Markdown representation of the PDF.
-
Use the following code to generate Markdown page by page:
import pandas as pd import datetime from pathlib import Path def main(): logging.basicConfig(level=logging.INFO) input_doc_path = Path("./tests/data/pdf/2206.01062.pdf") output_dir = Path("scratch") pipeline_options = PdfPipelineOptions() pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE pipeline_options.generate_page_images = True doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) conv_res = doc_converter.convert(input_doc_path) output_dir.mkdir(parents=True, exist_ok=True) rows = [] for ( content_text, content_md, content_dt, page_cells, page_segments, page, ) in generate_multimodal_pages(conv_res): dpi = page._default_image_scale * 72 rows.append( { "document": conv_res.input.file.name, "hash": conv_res.input.document_hash, "page_hash": create_hash( conv_res.input.document_hash + ":" + str(page.page_no - 1) ), "image": { "width": page.image.width, "height": page.image.height, "bytes": page.image.tobytes(), }, "cells": page_cells, "contents": content_text, "contents_md": content_md, "contents_dt": content_dt, "segments": page_segments, "extra": { "page_num": page.page_no + 1, "width_in_points": page.size.width, "height_in_points": page.size.height, "dpi": dpi, }, } ) # Generate one Excel file from all documents df_result = pd.json_normalize(rows) now = datetime.datetime.now() output_filename = output_dir / f"multimodal_{now:%Y-%m-%d_%H%M%S}.xlsx" df_result.to_excel(output_filename, index=False) -
Compare the
content_mdcolumn with the output of the original Markdown; you should see that they don't match. -
The issue is more prevalent with PDFs that contain many tables.
Python version
3.12
@Yash8745 I can not verify this right now but I know that the generate_multimodal_pages utility is terribly outdated, working on a legacy representation of the docling output. It clearly needs an update to support the current export methods. Would you volunteer to look into this?
Also stumbled upon this myself too, I ended up doing this, which is kinda dodgy, but works well enough:
pages = result.document.export_to_markdown(
page_break_placeholder="<PAGEBREAK/>"
).split("<PAGEBREAK/>")