export_to_markdown with page_no - parameter adds table at the end of the markdown
Bug
I have generated a ConversionResult with smoldocling-Vlm-Options of a large PDF file. When I use the export_to_markdown method with parameter page_no it adds a table at the end of each generated markdown text. When I remove the parameter and export the whole conversion result at once this table is not added to the markdown.
When I generate the ConversionResult with a smaller page_range, for example (1, 50) the export_to_markdown(page_no=..) works without any issues as well. So this appears just when converting the whole document and using page_no parameter.
Steps to reproduce
You find the PDF file attached. IBM_V7R4_RPG_Programmers Guide.pdf
The following code I am using for generating the ConversionResult. It takes quite long because of its size. Therefore I have saved the result to json with docling_result.document.save_to_json(). Unfortunately the file is too big to add it to the issue.
file_path = "IBM_V7R4_RPG_Programmers Guide.pdf"
pipeline_options = VlmPipelineOptions(
force_backend_text=False,
vlm_options=smoldocling_vlm_mlx_conversion_options
)
docling_converter = DocumentConverter(format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options
)
})
# Convert the PDF with Docling
docling_result = docling_converter.convert(file_path)
When I execute the following line with the export method..
print(docling_result.export_to_markdown(page_no=1))
..I get the following output. There is added this table, which is not part of the document/result.
IBM i 7.4
Programming IBM Rational Development Studio for i ILE RPG Programmer's Guide
Line Number Source Specifications Comments -> Scr Seg Number 33 1 2 26 35 34 1 2 26 35 35 1 2 26 35 36 1 2 26 35 37 1 2 26 35 38 1 2 26 35 39 1 2 26 35 40 1 2 26 35 41 1 2 26 35 42 1 2 26 35 43 1 2 26 35 44 1 2 26 35 45 1 2 26 35 46 1 2 26 35 47 1 2 26 35 48 1 2 26 35 49 1 2 26 35 50 1 2 26 35 51 1 2 26 35 52 1 2 26 35 53 1 2 26 35 54 1 2 26 35 55 1 2 26 35 56 1 2 26 35 57 1 2 26 35 58 1 2 26 35
This table is just added, when using the page_no parameter and when converting the whole document! The page_no parameter is needed, because with page_break_placeholder empty pages are removed.
Docling version
Docling version: 2.30.0 Docling Core version: 2.26.4 Docling IBM Models version: 3.4.1 Docling Parse version: 4.0.0 Python: cpython-311 (3.11.9) Platform: macOS-15.3.2-arm64-arm-64bit
Python version
Python 3.11.9
Thanks in advance for any help!
I'm gonna check
I'm experiencing a similar issue that appears related to this bug. In my case, I'm working with a large PDF document (hundreds of pages) and processing it using Docling.
The problem I'm facing
When using the export_to_markdown function with the page_no parameter to extract individual pages, each page's output contains page break markers (\n\n---\n\n) for all subsequent pages in the document.
For example, when exporting page 1 of a large document, the output includes hundreds of instances of \n\n---\n\n appended to the actual page content. The number of these page breaks corresponds exactly to the number of subsequent pages in the document.
Example code
# Convert the PDF
conversion_result = pdf_converter.convert(pdf_path, raises_on_error=False)
# Export page 1
markdown_content = conversion_result.document.export_to_markdown(
page_no=1,
image_mode=ImageRefMode.PLACEHOLDER,
image_placeholder="![Image]",
indent=2
)
Example output
The output starts with the correct content for page 1, but then includes hundreds of page break markers:
![Image]
## Document Title
Some content here...
![Image]
---
---
---
# ...and so on for hundreds more page breaks...
Observations
- This only happens with large documents (hundreds of pages)
- The number of extra page breaks exactly matches (total_pages - current_page)
- It occurs even when not explicitly setting a
page_break_placeholderoption - When exporting the full document at once, the page breaks appear correctly between pages
This appears to be the same underlying issue as reported here, but with page breaks being incorrectly included rather than a table. In both cases, export_to_markdown with page_no is including content from beyond the requested page.
My current workaround is to manually split the content at the first page break occurrence, but a proper fix would be greatly appreciated.
Docling version: 2.34.0 Python: 3.10 Platform: Linux