docling icon indicating copy to clipboard operation
docling copied to clipboard

export_to_markdown with page_no - parameter adds table at the end of the markdown

Open LukasBreit opened this issue 8 months ago • 2 comments

Bug

I have generated a ConversionResult with smoldocling-Vlm-Options of a large PDF file. When I use the export_to_markdown method with parameter page_no it adds a table at the end of each generated markdown text. When I remove the parameter and export the whole conversion result at once this table is not added to the markdown.

When I generate the ConversionResult with a smaller page_range, for example (1, 50) the export_to_markdown(page_no=..) works without any issues as well. So this appears just when converting the whole document and using page_no parameter.

Steps to reproduce

You find the PDF file attached. IBM_V7R4_RPG_Programmers Guide.pdf

The following code I am using for generating the ConversionResult. It takes quite long because of its size. Therefore I have saved the result to json with docling_result.document.save_to_json(). Unfortunately the file is too big to add it to the issue.

file_path = "IBM_V7R4_RPG_Programmers Guide.pdf"

pipeline_options = VlmPipelineOptions(
    force_backend_text=False,
    vlm_options=smoldocling_vlm_mlx_conversion_options
    )
docling_converter = DocumentConverter(format_options={
    InputFormat.PDF: PdfFormatOption(
        pipeline_cls=VlmPipeline,
        pipeline_options=pipeline_options
        )
})

# Convert the PDF with Docling
docling_result = docling_converter.convert(file_path)

When I execute the following line with the export method.. print(docling_result.export_to_markdown(page_no=1)) ..I get the following output. There is added this table, which is not part of the document/result.

IBM i 7.4

Programming IBM Rational Development Studio for i ILE RPG Programmer's Guide

Line Number Source Specifications Comments -> Scr Seg Number
33 1 2 26 35
34 1 2 26 35
35 1 2 26 35
36 1 2 26 35
37 1 2 26 35
38 1 2 26 35
39 1 2 26 35
40 1 2 26 35
41 1 2 26 35
42 1 2 26 35
43 1 2 26 35
44 1 2 26 35
45 1 2 26 35
46 1 2 26 35
47 1 2 26 35
48 1 2 26 35
49 1 2 26 35
50 1 2 26 35
51 1 2 26 35
52 1 2 26 35
53 1 2 26 35
54 1 2 26 35
55 1 2 26 35
56 1 2 26 35
57 1 2 26 35
58 1 2 26 35

This table is just added, when using the page_no parameter and when converting the whole document! The page_no parameter is needed, because with page_break_placeholder empty pages are removed.

Docling version

Docling version: 2.30.0 Docling Core version: 2.26.4 Docling IBM Models version: 3.4.1 Docling Parse version: 4.0.0 Python: cpython-311 (3.11.9) Platform: macOS-15.3.2-arm64-arm-64bit

Python version

Python 3.11.9

Thanks in advance for any help!

LukasBreit avatar Apr 23 '25 09:04 LukasBreit

I'm gonna check

rickymaggio02 avatar May 21 '25 09:05 rickymaggio02

I'm experiencing a similar issue that appears related to this bug. In my case, I'm working with a large PDF document (hundreds of pages) and processing it using Docling.

The problem I'm facing

When using the export_to_markdown function with the page_no parameter to extract individual pages, each page's output contains page break markers (\n\n---\n\n) for all subsequent pages in the document.

For example, when exporting page 1 of a large document, the output includes hundreds of instances of \n\n---\n\n appended to the actual page content. The number of these page breaks corresponds exactly to the number of subsequent pages in the document.

Example code

# Convert the PDF
conversion_result = pdf_converter.convert(pdf_path, raises_on_error=False)

# Export page 1
markdown_content = conversion_result.document.export_to_markdown(
    page_no=1,
    image_mode=ImageRefMode.PLACEHOLDER,
    image_placeholder="![Image]",
    indent=2
)

Example output

The output starts with the correct content for page 1, but then includes hundreds of page break markers:

![Image]

## Document Title

Some content here...

![Image]

---

---

---

# ...and so on for hundreds more page breaks...

Observations

  1. This only happens with large documents (hundreds of pages)
  2. The number of extra page breaks exactly matches (total_pages - current_page)
  3. It occurs even when not explicitly setting a page_break_placeholder option
  4. When exporting the full document at once, the page breaks appear correctly between pages

This appears to be the same underlying issue as reported here, but with page breaks being incorrectly included rather than a table. In both cases, export_to_markdown with page_no is including content from beyond the requested page.

My current workaround is to manually split the content at the first page break occurrence, but a proper fix would be greatly appreciated.

Docling version: 2.34.0 Python: 3.10 Platform: Linux

tsensei avatar May 30 '25 04:05 tsensei