docling icon indicating copy to clipboard operation
docling copied to clipboard

RuntimeError: Could not find the page-dimensions

Open lihomm opened this issue 2 months ago • 6 comments

Bug

When I test Docling with certain PDF files, it was throwing the runtime error to the specific PDF but not other PDFs I have tested. Unfortunately I cannot share the PDF since it involves medical and PnC information, ...

Steps to reproduce

Pipeline used:

        # 1b. Configure RapidOCR Options
        # --- Define models directory ---
        # Assumes models folder is at the root level (parent of app)
        models_dir = Path(__file__).parent.parent.parent / "models" / "rapidocr"
        models_dir.mkdir(parents=True, exist_ok=True) # Ensure directory exists
        logger.info(f"Ensuring RapidOCR models are in: {models_dir}")

        # --- Download RapidOCR models to the specified directory ---
        # This will only download if models aren't already in models_dir
        download_path = snapshot_download(
            repo_id="SWHL/RapidOCR",
            local_dir=models_dir, # Specify the target directory
            local_dir_use_symlinks=False # Recommended False on Windows for simplicity
        )
        logger.info(f"RapidOCR models location: {download_path}") # download_path will be models_dir

        # Setup RapidOcrOptions using paths relative to the download_path (which is models_dir)
        # Ensure these sub-paths (PP-OCRv4, PP-OCRv3) match the actual downloaded folder structure
        det_model_path = os.path.join(
            download_path, "PP-OCRv4", "en_PP-OCRv3_det_infer.onnx"
        )
        rec_model_path = os.path.join(
            download_path, "PP-OCRv3", "en_PP-OCRv3_rec_infer.onnx"
        )
        cls_model_path = os.path.join(
            download_path, "PP-OCRv3", "ch_ppocr_mobile_v2.0_cls_train.onnx"
        )

        # Optimized configuration for OCR
        rapidocr_options = RapidOcrOptions(
            det_model_path=det_model_path,
            rec_model_path=rec_model_path,
            cls_model_path=cls_model_path,
            lang=['english'],  # Focus on English
            text_score=0.6,    # Confidence threshold for text detection
            force_full_page_ocr=True,  # Process the whole page
            use_det=True,      # Enable text detection
            use_rec=True,      # Enable text recognition
            use_cls=True,      # Enable text orientation classification
            bitmap_area_threshold=0.03,  # Threshold for detecting small text areas
            print_verbose=True  # Enable debugging output
        )

        # 2. Configure Table Structure Options
        table_options = TableStructureOptions(
            mode=TableFormerMode.ACCURATE,
            do_cell_matching=True
        )

        # 3. Configure Accelerator Options
        accelerator_options = AcceleratorOptions(
            device=AcceleratorDevice.CUDA if torch.cuda.is_available() else AcceleratorDevice.CPU,
            num_threads=8,
            cuda_use_flash_attention2=torch.cuda.is_available(),
        )

        # 4. Configure Main Pipeline Options
        pipeline_options = PdfPipelineOptions(
            do_ocr=True,
            ocr_options=rapidocr_options,
            do_table_structure=True,
            table_structure_options=table_options,
            accelerator_options=accelerator_options,
            generate_page_images=True, # Keep True for visualizations
            generate_picture_images=True,
            create_legacy_output=True,
            document_timeout=300,
        )

       converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
            }
        )

        data = converter.convert(str(file_path))
  1. Use the setup as above, and run a PDF into this pipeline.
  2. It should throw an error

...

Docling version

2025-11-04 02:49:17,500 - INFO - Loading plugin 'docling_defaults' 2025-11-04 02:49:17,501 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract'] Docling version: 2.31.0 Docling Core version: 2.50.0 Docling IBM Models version: 3.10.2 Docling Parse version: 4.7.0 Python: cpython-310 (3.10.17) Platform: Windows-10-10.0.19045-SP0 ...

Python version

Python 3.10.17 ...

Logs:

[2025-11-04 00:45:41] [ERROR] docling.pipeline.standard_pdf_pipeline - standard_pdf_pipeline.py:349 - Stage preprocess failed for run 1: could not find the page-dimensions: {
    "/Contents": [
        "5 0 R [stream]"
    ],
    "/Parent": "[skipping /Parent]",
    "/Resources": {
        "/Font": {
            "/c": {
                "/BaseFont": "/AAAAAA+Arial,Bold",
                "/FirstChar": 32,
                "/FontDescriptor": {
                    "/Ascent": 905,
                    "/AvgWidth": 479,
                    "/CapHeight": 500,
                    "/Descent": -212,
                    "/Flags": 4,
                    "/FontBBox": [
                        -628,
                        -376,
                        2000,
                        1056
                    ],
                    "/FontFile2": "103 0 R [stream]",
                    "/FontName": "/AAAAAA+Arial,Bold",
                    "/ItalicAngle": 0,
                    "/Leading": 0,
                    "/MaxWidth": 2628,
                    "/MissingWidth": 479,
                    "/StemH": 0,
                    "/StemV": 0,
                    "/Type": "/FontDescriptor",
                    "/XHeight": 0
                },
                "/LastChar": 116,
                "/Subtype": "/TrueType",
                "/ToUnicode": "100 0 R [stream]",
                "/Type": "/Font",
                "/Widths": [
                    278,
                    0,
                    0,
                    0,
                    0,
                    0,
                    722,
                    0,
                    333,
                    333,
                    0,
                    0,
                    278,
                    333,
                    278,
                    278,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    333,
                    0,
                    0,
                    0,
                    0,
                    0,
                    975,
                    722,
                    722,
                    722,
                    722,
                    667,
                    611,
                    778,
                    722,
                    278,
                    0,
                    722,
                    611,
                    833,
                    722,
                    778,
                    667,
                    778,
                    722,
                    667,
                    611,
                    722,
                    667,
                    944,
                    0,
                    667,
                    611,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    556,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    278,
                    0,
                    0,
                    611,
                    0,
                    0,
                    0,
                    0,
                    333
                ]
            },
            "/d": {
                "/BaseFont": "/AAAAAB+Arial",
                "/FirstChar": 32,
                "/FontDescriptor": {
                    "/Ascent": 905,
                    "/AvgWidth": 441,
                    "/CapHeight": 500,
                    "/Descent": -212,
                    "/Flags": 4,
                    "/FontBBox": [
                        -665,
                        -325,
                        2000,
                        1040
                    ],
                    "/FontFile2": "113 0 R [stream]",
                    "/FontName": "/AAAAAB+Arial",
                    "/ItalicAngle": 0,
                    "/Leading": 0,
                    "/MaxWidth": 2665,
                    "/MissingWidth": 441,
                    "/StemH": 0,
                    "/StemV": 0,
                    "/Type": "/FontDescriptor",
                    "/XHeight": 0
                },
                "/LastChar": 121,
                "/Subtype": "/TrueType",
                "/ToUnicode": "110 0 R [stream]",
                "/Type": "/Font",
                "/Widths": [
                    278,
                    278,
                    355,
                    0,
                    0,
                    889,
                    667,
                    0,
                    333,
                    333,
                    389,
                    584,
                    278,
                    333,
                    278,
                    278,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    556,
                    278,
                    0,
                    0,
                    0,
                    584,
                    0,
                    0,
                    667,
                    667,
                    722,
                    722,
                    667,
                    611,
                    778,
                    722,
                    278,
                    500,
                    667,
                    556,
                    833,
                    722,
                    778,
                    667,
                    778,
                    722,
                    667,
                    611,
                    722,
                    667,
                    944,
                    667,
                    667,
                    611,
                    278,
                    0,
                    278,
                    0,
                    0,
                    0,
                    556,
                    556,
                    500,
                    556,
                    556,
                    278,
                    556,
                    556,
                    222,
                    222,
                    0,
                    222,
                    833,
                    556,
                    556,
                    0,
                    0,
                    333,
                    500,
                    278,
                    556,
                    500,
                    722,
                    500,
                    500
                ]
            }
        },
        "/ProcSet": [
            "/PDF",
            "/Text",
            "/ImageB",
            "/ImageC",
            "/ImageI"
        ],
        "/XObject": {
            "/img0": "9 0 R [stream]"
        }
    },
    "/Type": "/Page"
}
...

lihomm avatar Nov 03 '25 18:11 lihomm

Hi @lihomm! I'm Dosu and I’m helping the docling team.

This error happens because Docling expects standard page dimension metadata in the PDF, but some PDFs—often those created with tools like MacOS Quartz PDFContext or with custom sizes—omit or misformat this information. When Docling can't extract the dimensions, it raises this runtime error and stops processing. This is a known issue and has been reported with other custom or landscape PDFs as well example.

Workarounds that have helped others include switching the PDF backend in your pipeline options to 'pypdfium2', which is more robust with unusual PDFs:

from docling.datamodel.pipeline_options import PdfPipelineOptions
pipeline_options = PdfPipelineOptions(pdf_backend='pypdfium2')

Note: Using this backend eliminates the error, but the conversion result may omit most of the text from the original document by converting it into an embedded image, so you may need to rely on OCR for text extraction details.

If that doesn't work, converting the problematic PDF pages to images and running OCR (by setting do_ocr=True in your pipeline options) can bypass the dimension extraction step example.

If you want to dig deeper, inspect the problematic PDF with tools like qpdf --check or pdfinfo to look for anomalies, or try extracting a single page or stripping content to isolate the trigger. This issue is backend-specific and has been fixed or mitigated in some cases by using a different backend or updating to the latest docling-parse version example.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 03 '25 18:11 dosubot[bot]

Using pymupdf to force set of the mediabox and cropbox before feeding into Docling fixed this issue for me. You need to do it on all pages.

doc = pymupdf.open(stream=pdf_bytes, filetype="pdf")
for page in doc:
   fallback = page.cropbox if page.cropbox.width > 0 and page.cropbox.height > 0 else page.rect
   page.set_mediabox(fallback)
   page.set_cropbox(fallback)

liampetti avatar Nov 04 '25 08:11 liampetti

Thanks for the suggestions, I will try this out, but is there any solutions besides using pymupdf since it's under AGPL license?

lihomm avatar Nov 04 '25 13:11 lihomm

It appears that this commit is the source of the issue: https://github.com/docling-project/docling-parse/pull/173

aaksac-benchsci avatar Nov 06 '25 01:11 aaksac-benchsci

Thanks for the suggestions, I will try this out, but is there any solutions besides using pymupdf since it's under AGPL license?

maybe pypdf?

from pypdf import PdfReader, PdfWriter

reader = PdfReader(pdf_filename)
writer = PdfWriter()

for page in reader.pages:
    cropbox = page.cropbox
    if cropbox.width > 0 and cropbox.height > 0:
        fallback = cropbox
    else:
        fallback = page.mediabox

    page.mediabox = fallback
    page.cropbox = fallback
    writer.add_page(page)

with open(pdf_filename, 'wb') as output_file:
    writer.write(output_file)

aaksac-benchsci avatar Nov 06 '25 14:11 aaksac-benchsci

For more background on this issue, attaching here links to PDF files where I see the same behavior:

  • File 2382400.pdf within https://digitalcorpora.s3.amazonaws.com/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/2000-2999/2382.zip
  • File 2028148.pdf within https://digitalcorpora.s3.amazonaws.com/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/2000-2999/2028.zip
  • File 1068798.pdf within https://digitalcorpora.s3.amazonaws.com/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/1000-1999/1068.zip

With message output like:

ERROR - Stage preprocess failed for run 1: could not find the page-dimensions:
...
INFO - Finished converting document 2028148.pdf in 101.57 sec.
WARNING - Document /var/folders/2r/b2sdj1512g1_0m7wzzy7sftr0000gn/T/tmpotozkvu3/2028148.pdf failed to convert.
INFO -   [Failure Detail] Component: DoclingComponentType.PIPELINE, Module: StandardPdfPipeline, Message: Page 1: could not find the page-dimensions:
...

Docling version: 2.63.0 Docling Core version: 2.52.0 Docling IBM Models version: 3.10.2 Docling Parse version: 4.7.1 Python: cpython-312 (3.12.10) Platform: macOS-14.7.1-arm64-arm-64bit

ceberam avatar Nov 21 '25 13:11 ceberam