docling `page_range` parameter stops prematurely at page 32 when starting from page 30+

Bug

The page_range parameter in DocumentConverter.convert() does not properly extract the full requested range when the range spans into the 30s-40s region. The conversion prematurely stops at page 32 instead of continuing to the specified end page.

Pattern observed:

page_range=(1, 45) → Works correctly, extracts pages 1-45
page_range=(30, 35) → Stops at page 32, extracts pages 30-32 instead of 30-35
page_range=(30, 45) → Stops at page 32, extracts pages 30-32 instead of 30-45

This suggests the issue occurs when:

Start page is 30 or higher, AND
The range is requested to go beyond page 32
Or maybe related to the number of pages
Suspected a sys.maxsize issue, but nah

The conversion appears to stop at the end of page 32, regardless of the requested end page.

Steps to reproduce

Use a PDF with at least 50 pages
Test these page ranges:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# Works fine
result1 = converter.convert("test.pdf", page_range=(1, 45)).document.export_to_markdown()
print(result1)
# Manually checking the final lines' content in the PDF file

# Fails - stops at page 32
result2 = converter.convert("test.pdf", page_range=(30, 35)).document.export_to_markdown()
print(result2)
# Manually checking the final lines' content in the PDF file

# Fails - stops at page 32
result3 = converter.convert("test.pdf", page_range=(30, 45)).document.export_to_markdown()
print(result3)
# Manually checking the final lines' content in the PDF file

Environment

Python 3.11.13 UV for package management docling=2.56.1

Don't hesitate to ask for extra information/context :)

Nov 18 '25 23:11 Ouassim-Hamdani

Hi @Ouassim-Hamdani! I'm Dosu and I’m helping the docling team.

This looks like a bug in the threaded PDF pipeline: the draining batch size is hardcoded to 32, and if the pipeline or backend closes the output queue early, any pages beyond that batch can be dropped. There’s no explicit logic in the page range handling to stop at 32, so the cutoff is likely a side effect of how batches are processed and how the backend signals completion. This is not a documented or intended limitation.

If you need a workaround, try extracting a range that starts before page 30 (for example, 25–45) and see if you get more pages, or split your extraction into smaller chunks (e.g., 30–32, 33–35, etc.) and combine the results. If you’re able to test with a different backend or pipeline (such as a non-threaded one), that may also avoid the issue.

The root cause is likely in the threaded pipeline’s batch draining logic or the backend’s signaling, not in the page_range parameter itself. If you need more detail on the backend or want to try a code workaround, let me know!
Relevant code reference

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Nov 19 '25 00:11 dosubot[bot]

Hey, I've checked Copilot' suggested solution and implemented it in my environment. Unfortunately, it fails to solve the issue at hand.

Nov 19 '25 15:11 Ouassim-Hamdani

Thanks for testing that out. Based on the latest code, the root cause is a hardcoded batch size of 32 in the threaded PDF pipeline's draining logic. When you request a page range starting at 30 or higher, the pipeline only processes up to 32 pages before stopping, regardless of your specified end page. This isn't an intended limitation—it's a side effect of how batches are drained and how the output queue is closed if the pipeline can't keep up, which matches exactly what you're seeing (source).

Could you share what Copilot suggested? That way, I can avoid repeating advice and help you find a workaround or patch that actually addresses the batch size issue. If you have any logs or error messages from your last run, those would also help pinpoint if the queue is closing early or if something else is going on.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Nov 19 '25 15:11 dosubot[bot]

Copilot suggestions from the active merge request :

[docling/experimental/pipeline/threaded_layout_vlm_pipeline.py]

batch_size = 32

batch_size = total_pages

[docling/pipeline/standard_pdf_pipeline.py] - batch_size: int = 32 # drain chunk + batch_size: int = total_pages # drain chunk

Nov 19 '25 15:11 Ouassim-Hamdani

Changing batch_size from 32 to total_pages in the pipeline files doesn't resolve the premature stopping at page 32. The draining logic is designed to process batches from the output queue, but if the queue closes early (due to upstream issues or signaling bugs), pages beyond the first batch can be dropped—regardless of batch size. Setting batch_size to total_pages just means the pipeline tries to drain all pages at once, which can cause memory spikes and latency for large documents, but doesn't fix the underlying queue management problem. The real issue is how and when the output queue closes, not the batch size itself (code reference).

A more robust fix would involve reviewing the queue closing logic and ensuring all requested pages are processed before the queue signals completion. For now, splitting your extraction into smaller page ranges or using a non-threaded/legacy pipeline may help as a workaround. If you want to dig deeper, focus on the output queue's lifecycle and how the pipeline signals early termination.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Nov 19 '25 15:11 dosubot[bot]

For anyone else, going through the same trouble here is a temporary workaround (creating a temp PDF file of the desired pages, then parsing it all)

import tempfile
from pypdf import PdfWriter

writer = PdfWriter()
writer.append("file.pdf", pages=(start_page-1, end_page)) # assuming variables start_page and endèpage are 1-index based

with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp_file:
    writer.write(tmp_file)
    tmp_path = tmp_file.name

resp = DocumentConverter().convert(tmp_path).document

# make sure to add a mechanism to delete the temp_file after finishing with it.

Nov 19 '25 19:11 Ouassim-Hamdani

@Ouassim-Hamdani thanks for reporting and checking the copilot PR. I also don't believe it solved the problem and will check up myself now.

Dec 01 '25 14:12 cau-git

@Ouassim-Hamdani I think there might be a different problem here. I was checking with a large document (https://api.printnode.com/static/test/pdf/a4_500_pages.pdf) and I get the correct pages out (range 30,35), however in a bad order.

## Page 33 / 500

## Page 34 / 500

## Page 35 / 500

Page 30 / 500

## Page 31 / 500

## Page 32 / 500

Dec 01 '25 15:12 cau-git

@Ouassim-Hamdani the actual fix this needs is here: https://github.com/docling-project/docling-ibm-models/pull/141

It is hard to understand how this bug did not show effects anywhere earlier...

Dec 01 '25 16:12 cau-git