RuntimeError: Could not find the page-dimensions
Bug
When I test Docling with certain PDF files, it was throwing the runtime error to the specific PDF but not other PDFs I have tested. Unfortunately I cannot share the PDF since it involves medical and PnC information, ...
Steps to reproduce
Pipeline used:
# 1b. Configure RapidOCR Options
# --- Define models directory ---
# Assumes models folder is at the root level (parent of app)
models_dir = Path(__file__).parent.parent.parent / "models" / "rapidocr"
models_dir.mkdir(parents=True, exist_ok=True) # Ensure directory exists
logger.info(f"Ensuring RapidOCR models are in: {models_dir}")
# --- Download RapidOCR models to the specified directory ---
# This will only download if models aren't already in models_dir
download_path = snapshot_download(
repo_id="SWHL/RapidOCR",
local_dir=models_dir, # Specify the target directory
local_dir_use_symlinks=False # Recommended False on Windows for simplicity
)
logger.info(f"RapidOCR models location: {download_path}") # download_path will be models_dir
# Setup RapidOcrOptions using paths relative to the download_path (which is models_dir)
# Ensure these sub-paths (PP-OCRv4, PP-OCRv3) match the actual downloaded folder structure
det_model_path = os.path.join(
download_path, "PP-OCRv4", "en_PP-OCRv3_det_infer.onnx"
)
rec_model_path = os.path.join(
download_path, "PP-OCRv3", "en_PP-OCRv3_rec_infer.onnx"
)
cls_model_path = os.path.join(
download_path, "PP-OCRv3", "ch_ppocr_mobile_v2.0_cls_train.onnx"
)
# Optimized configuration for OCR
rapidocr_options = RapidOcrOptions(
det_model_path=det_model_path,
rec_model_path=rec_model_path,
cls_model_path=cls_model_path,
lang=['english'], # Focus on English
text_score=0.6, # Confidence threshold for text detection
force_full_page_ocr=True, # Process the whole page
use_det=True, # Enable text detection
use_rec=True, # Enable text recognition
use_cls=True, # Enable text orientation classification
bitmap_area_threshold=0.03, # Threshold for detecting small text areas
print_verbose=True # Enable debugging output
)
# 2. Configure Table Structure Options
table_options = TableStructureOptions(
mode=TableFormerMode.ACCURATE,
do_cell_matching=True
)
# 3. Configure Accelerator Options
accelerator_options = AcceleratorOptions(
device=AcceleratorDevice.CUDA if torch.cuda.is_available() else AcceleratorDevice.CPU,
num_threads=8,
cuda_use_flash_attention2=torch.cuda.is_available(),
)
# 4. Configure Main Pipeline Options
pipeline_options = PdfPipelineOptions(
do_ocr=True,
ocr_options=rapidocr_options,
do_table_structure=True,
table_structure_options=table_options,
accelerator_options=accelerator_options,
generate_page_images=True, # Keep True for visualizations
generate_picture_images=True,
create_legacy_output=True,
document_timeout=300,
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
data = converter.convert(str(file_path))
- Use the setup as above, and run a PDF into this pipeline.
- It should throw an error
...
Docling version
2025-11-04 02:49:17,500 - INFO - Loading plugin 'docling_defaults' 2025-11-04 02:49:17,501 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract'] Docling version: 2.31.0 Docling Core version: 2.50.0 Docling IBM Models version: 3.10.2 Docling Parse version: 4.7.0 Python: cpython-310 (3.10.17) Platform: Windows-10-10.0.19045-SP0 ...
Python version
Python 3.10.17 ...
Logs:
[2025-11-04 00:45:41] [ERROR] docling.pipeline.standard_pdf_pipeline - standard_pdf_pipeline.py:349 - Stage preprocess failed for run 1: could not find the page-dimensions: {
"/Contents": [
"5 0 R [stream]"
],
"/Parent": "[skipping /Parent]",
"/Resources": {
"/Font": {
"/c": {
"/BaseFont": "/AAAAAA+Arial,Bold",
"/FirstChar": 32,
"/FontDescriptor": {
"/Ascent": 905,
"/AvgWidth": 479,
"/CapHeight": 500,
"/Descent": -212,
"/Flags": 4,
"/FontBBox": [
-628,
-376,
2000,
1056
],
"/FontFile2": "103 0 R [stream]",
"/FontName": "/AAAAAA+Arial,Bold",
"/ItalicAngle": 0,
"/Leading": 0,
"/MaxWidth": 2628,
"/MissingWidth": 479,
"/StemH": 0,
"/StemV": 0,
"/Type": "/FontDescriptor",
"/XHeight": 0
},
"/LastChar": 116,
"/Subtype": "/TrueType",
"/ToUnicode": "100 0 R [stream]",
"/Type": "/Font",
"/Widths": [
278,
0,
0,
0,
0,
0,
722,
0,
333,
333,
0,
0,
278,
333,
278,
278,
556,
556,
556,
556,
556,
556,
556,
556,
556,
556,
333,
0,
0,
0,
0,
0,
975,
722,
722,
722,
722,
667,
611,
778,
722,
278,
0,
722,
611,
833,
722,
778,
667,
778,
722,
667,
611,
722,
667,
944,
0,
667,
611,
0,
0,
0,
0,
0,
0,
556,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
278,
0,
0,
611,
0,
0,
0,
0,
333
]
},
"/d": {
"/BaseFont": "/AAAAAB+Arial",
"/FirstChar": 32,
"/FontDescriptor": {
"/Ascent": 905,
"/AvgWidth": 441,
"/CapHeight": 500,
"/Descent": -212,
"/Flags": 4,
"/FontBBox": [
-665,
-325,
2000,
1040
],
"/FontFile2": "113 0 R [stream]",
"/FontName": "/AAAAAB+Arial",
"/ItalicAngle": 0,
"/Leading": 0,
"/MaxWidth": 2665,
"/MissingWidth": 441,
"/StemH": 0,
"/StemV": 0,
"/Type": "/FontDescriptor",
"/XHeight": 0
},
"/LastChar": 121,
"/Subtype": "/TrueType",
"/ToUnicode": "110 0 R [stream]",
"/Type": "/Font",
"/Widths": [
278,
278,
355,
0,
0,
889,
667,
0,
333,
333,
389,
584,
278,
333,
278,
278,
556,
556,
556,
556,
556,
556,
556,
556,
556,
556,
278,
0,
0,
0,
584,
0,
0,
667,
667,
722,
722,
667,
611,
778,
722,
278,
500,
667,
556,
833,
722,
778,
667,
778,
722,
667,
611,
722,
667,
944,
667,
667,
611,
278,
0,
278,
0,
0,
0,
556,
556,
500,
556,
556,
278,
556,
556,
222,
222,
0,
222,
833,
556,
556,
0,
0,
333,
500,
278,
556,
500,
722,
500,
500
]
}
},
"/ProcSet": [
"/PDF",
"/Text",
"/ImageB",
"/ImageC",
"/ImageI"
],
"/XObject": {
"/img0": "9 0 R [stream]"
}
},
"/Type": "/Page"
}
...
Hi @lihomm! I'm Dosu and I’m helping the docling team.
This error happens because Docling expects standard page dimension metadata in the PDF, but some PDFs—often those created with tools like MacOS Quartz PDFContext or with custom sizes—omit or misformat this information. When Docling can't extract the dimensions, it raises this runtime error and stops processing. This is a known issue and has been reported with other custom or landscape PDFs as well example.
Workarounds that have helped others include switching the PDF backend in your pipeline options to 'pypdfium2', which is more robust with unusual PDFs:
from docling.datamodel.pipeline_options import PdfPipelineOptions
pipeline_options = PdfPipelineOptions(pdf_backend='pypdfium2')
Note: Using this backend eliminates the error, but the conversion result may omit most of the text from the original document by converting it into an embedded image, so you may need to rely on OCR for text extraction details.
If that doesn't work, converting the problematic PDF pages to images and running OCR (by setting do_ocr=True in your pipeline options) can bypass the dimension extraction step example.
If you want to dig deeper, inspect the problematic PDF with tools like qpdf --check or pdfinfo to look for anomalies, or try extracting a single page or stripping content to isolate the trigger. This issue is backend-specific and has been fixed or mitigated in some cases by using a different backend or updating to the latest docling-parse version example.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Using pymupdf to force set of the mediabox and cropbox before feeding into Docling fixed this issue for me. You need to do it on all pages.
doc = pymupdf.open(stream=pdf_bytes, filetype="pdf")
for page in doc:
fallback = page.cropbox if page.cropbox.width > 0 and page.cropbox.height > 0 else page.rect
page.set_mediabox(fallback)
page.set_cropbox(fallback)
Thanks for the suggestions, I will try this out, but is there any solutions besides using pymupdf since it's under AGPL license?
It appears that this commit is the source of the issue: https://github.com/docling-project/docling-parse/pull/173
Thanks for the suggestions, I will try this out, but is there any solutions besides using pymupdf since it's under AGPL license?
maybe pypdf?
from pypdf import PdfReader, PdfWriter
reader = PdfReader(pdf_filename)
writer = PdfWriter()
for page in reader.pages:
cropbox = page.cropbox
if cropbox.width > 0 and cropbox.height > 0:
fallback = cropbox
else:
fallback = page.mediabox
page.mediabox = fallback
page.cropbox = fallback
writer.add_page(page)
with open(pdf_filename, 'wb') as output_file:
writer.write(output_file)
For more background on this issue, attaching here links to PDF files where I see the same behavior:
- File
2382400.pdfwithin https://digitalcorpora.s3.amazonaws.com/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/2000-2999/2382.zip - File
2028148.pdfwithin https://digitalcorpora.s3.amazonaws.com/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/2000-2999/2028.zip - File
1068798.pdfwithin https://digitalcorpora.s3.amazonaws.com/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/1000-1999/1068.zip
With message output like:
ERROR - Stage preprocess failed for run 1: could not find the page-dimensions:
...
INFO - Finished converting document 2028148.pdf in 101.57 sec.
WARNING - Document /var/folders/2r/b2sdj1512g1_0m7wzzy7sftr0000gn/T/tmpotozkvu3/2028148.pdf failed to convert.
INFO - [Failure Detail] Component: DoclingComponentType.PIPELINE, Module: StandardPdfPipeline, Message: Page 1: could not find the page-dimensions:
...
Docling version: 2.63.0 Docling Core version: 2.52.0 Docling IBM Models version: 3.10.2 Docling Parse version: 4.7.1 Python: cpython-312 (3.12.10) Platform: macOS-14.7.1-arm64-arm-64bit