`page_range` parameter stops prematurely at page 32 when starting from page 30+
Bug
The page_range parameter in DocumentConverter.convert() does not properly extract the full requested range when the range spans into the 30s-40s region. The conversion prematurely stops at page 32 instead of continuing to the specified end page.
Pattern observed:
page_range=(1, 45)→ Works correctly, extracts pages 1-45page_range=(30, 35)→ Stops at page 32, extracts pages 30-32 instead of 30-35page_range=(30, 45)→ Stops at page 32, extracts pages 30-32 instead of 30-45
This suggests the issue occurs when:
- Start page is 30 or higher, AND
- The range is requested to go beyond page 32
- Or maybe related to the number of pages
- Suspected a sys.maxsize issue, but nah
The conversion appears to stop at the end of page 32, regardless of the requested end page.
Steps to reproduce
- Use a PDF with at least 50 pages
- Test these page ranges:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
# Works fine
result1 = converter.convert("test.pdf", page_range=(1, 45)).document.export_to_markdown()
print(result1)
# Manually checking the final lines' content in the PDF file
# Fails - stops at page 32
result2 = converter.convert("test.pdf", page_range=(30, 35)).document.export_to_markdown()
print(result2)
# Manually checking the final lines' content in the PDF file
# Fails - stops at page 32
result3 = converter.convert("test.pdf", page_range=(30, 45)).document.export_to_markdown()
print(result3)
# Manually checking the final lines' content in the PDF file
Environment
Python 3.11.13 UV for package management docling=2.56.1
Don't hesitate to ask for extra information/context :)
Hi @Ouassim-Hamdani! I'm Dosu and I’m helping the docling team.
This looks like a bug in the threaded PDF pipeline: the draining batch size is hardcoded to 32, and if the pipeline or backend closes the output queue early, any pages beyond that batch can be dropped. There’s no explicit logic in the page range handling to stop at 32, so the cutoff is likely a side effect of how batches are processed and how the backend signals completion. This is not a documented or intended limitation.
If you need a workaround, try extracting a range that starts before page 30 (for example, 25–45) and see if you get more pages, or split your extraction into smaller chunks (e.g., 30–32, 33–35, etc.) and combine the results. If you’re able to test with a different backend or pipeline (such as a non-threaded one), that may also avoid the issue.
The root cause is likely in the threaded pipeline’s batch draining logic or the backend’s signaling, not in the page_range parameter itself. If you need more detail on the backend or want to try a code workaround, let me know!
Relevant code reference
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Hey, I've checked Copilot' suggested solution and implemented it in my environment. Unfortunately, it fails to solve the issue at hand.
Thanks for testing that out. Based on the latest code, the root cause is a hardcoded batch size of 32 in the threaded PDF pipeline's draining logic. When you request a page range starting at 30 or higher, the pipeline only processes up to 32 pages before stopping, regardless of your specified end page. This isn't an intended limitation—it's a side effect of how batches are drained and how the output queue is closed if the pipeline can't keep up, which matches exactly what you're seeing (source).
Could you share what Copilot suggested? That way, I can avoid repeating advice and help you find a workaround or patch that actually addresses the batch size issue. If you have any logs or error messages from your last run, those would also help pinpoint if the queue is closing early or if something else is going on.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Copilot suggestions from the active merge request :
[docling/experimental/pipeline/threaded_layout_vlm_pipeline.py]
- batch_size = 32
- batch_size = total_pages
[docling/pipeline/standard_pdf_pipeline.py] - batch_size: int = 32 # drain chunk + batch_size: int = total_pages # drain chunk
Changing batch_size from 32 to total_pages in the pipeline files doesn't resolve the premature stopping at page 32. The draining logic is designed to process batches from the output queue, but if the queue closes early (due to upstream issues or signaling bugs), pages beyond the first batch can be dropped—regardless of batch size. Setting batch_size to total_pages just means the pipeline tries to drain all pages at once, which can cause memory spikes and latency for large documents, but doesn't fix the underlying queue management problem. The real issue is how and when the output queue closes, not the batch size itself (code reference).
A more robust fix would involve reviewing the queue closing logic and ensuring all requested pages are processed before the queue signals completion. For now, splitting your extraction into smaller page ranges or using a non-threaded/legacy pipeline may help as a workaround. If you want to dig deeper, focus on the output queue's lifecycle and how the pipeline signals early termination.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
For anyone else, going through the same trouble here is a temporary workaround (creating a temp PDF file of the desired pages, then parsing it all)
import tempfile
from pypdf import PdfWriter
writer = PdfWriter()
writer.append("file.pdf", pages=(start_page-1, end_page)) # assuming variables start_page and endèpage are 1-index based
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp_file:
writer.write(tmp_file)
tmp_path = tmp_file.name
resp = DocumentConverter().convert(tmp_path).document
# make sure to add a mechanism to delete the temp_file after finishing with it.
@Ouassim-Hamdani thanks for reporting and checking the copilot PR. I also don't believe it solved the problem and will check up myself now.
@Ouassim-Hamdani I think there might be a different problem here. I was checking with a large document (https://api.printnode.com/static/test/pdf/a4_500_pages.pdf) and I get the correct pages out (range 30,35), however in a bad order.
## Page 33 / 500
## Page 34 / 500
## Page 35 / 500
Page 30 / 500
## Page 31 / 500
## Page 32 / 500
@Ouassim-Hamdani the actual fix this needs is here: https://github.com/docling-project/docling-ibm-models/pull/141
It is hard to understand how this bug did not show effects anywhere earlier...