docling icon indicating copy to clipboard operation
docling copied to clipboard

IndexError while processing a PDF file

Open tomasamenezes opened this issue 10 months ago • 1 comments

Bug

...

Steps to reproduce

JUST RUNNING the code below in Google COLAB (or VSCODE): from docling.document_converter import DocumentConverter

source = "LAB_CLINICAL_EXAMS.pdf"

converter = DocumentConverter() result = converter.convert(source) print(result.document.export_to_markdown())

...

Docling version

FROM COLAB: Docling version: 2.23.0 Docling Core version: 2.19.1 Docling IBM Models version: 3.3.2 Docling Parse version: 3.3.1 Python: cpython-311 (3.11.11) Platform: Linux-6.1.85+-x86_64-with-glibc2.35

...

Python version

Python from Google COLAB: Python: cpython-311 (3.11.11)

...

ATTENTION: PDF File is a sensitive file (not included), received for a Clinical Analysis Laboratory

tomasamenezes avatar Feb 17 '25 19:02 tomasamenezes

WARNING:docling.pipeline.base_pipeline:Encountered an error during conversion of document 79293b249882e9d825d69880635178dc311c3534871a9a582a8be14e35fb9467: Traceback (most recent call last):

File "/usr/local/lib/python3.11/dist-packages/docling/pipeline/base_pipeline.py", line 163, in _build_document for p in pipeline_pages: # Must exhaust!

File "/usr/local/lib/python3.11/dist-packages/docling/pipeline/base_pipeline.py", line 127, in _apply_on_pages yield from page_batch

File "/usr/local/lib/python3.11/dist-packages/docling/models/page_assemble_model.py", line 60, in call for page in page_batch:

File "/usr/local/lib/python3.11/dist-packages/docling/models/table_structure_model.py", line 178, in call for page in page_batch:

File "/usr/local/lib/python3.11/dist-packages/docling/models/layout_model.py", line 146, in call for page in page_batch:

File "/usr/local/lib/python3.11/dist-packages/docling/models/easyocr_model.py", line 127, in call for page in page_batch:

File "/usr/local/lib/python3.11/dist-packages/docling/models/page_preprocessing_model.py", line 25, in call for page in page_batch:

File "/usr/local/lib/python3.11/dist-packages/docling/pipeline/standard_pdf_pipeline.py", line 229, in initialize_page page._backend = conv_res.input._backend.load_page(page.page_no) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.11/dist-packages/docling/backend/docling_parse_v2_backend.py", line 239, in load_page return DoclingParseV2PageBackend( ^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/usr/local/lib/python3.11/dist-packages/docling/backend/docling_parse_v2_backend.py", line 27, in init parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

IndexError: basic_string::at: __n (which is 1) >= this->size() (which is 1)


IndexError Traceback (most recent call last) in <cell line: 0>() 9 10 converter = DocumentConverter() ---> 11 result = converter.convert(source) 12 print(result.document.export_to_markdown()) 13 # result.document.save_as_markdown()

19 frames /usr/local/lib/python3.11/dist-packages/docling/backend/docling_parse_v2_backend.py in init(self, parser, document_hash, page_no, page_obj) 25 ): 26 self._ppage = page_obj ---> 27 parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no) 28 29 self.valid = "pages" in parsed_page and len(parsed_page["pages"]) == 1

IndexError: basic_string::at: __n (which is 1) >= this->size() (which is 1)

tomasamenezes avatar Feb 17 '25 19:02 tomasamenezes

This same issue occurred to me with some few scientific documents, for instance:

https://openaccess.thecvf.com/content/CVPR2022/papers/Huang_Weakly-Supervised_Metric_Learning_With_Cross-Module_Communications_for_the_Classification_of_CVPR_2022_paper.pdf

Docling version: 2.31.1 Docling Core version: 2.30.0 Docling IBM Models version: 3.4.3 Docling Parse version: 4.0.1 Python: cpython-312 (3.12.8) Platform: macOS-14.7.1-arm64-arm-64bit

ceberam avatar May 13 '25 09:05 ceberam

Got the error as well on Ubuntu with this file.

PierreMesure avatar Jun 23 '25 05:06 PierreMesure

I got similar error:

  File "XXX/.venv/lib/python3.12/site-packages/docling_parse/pdf_parser.py", line 129, in get_page
    doc_dict = self._parser.parse_pdf_from_key_on_page(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: basic_string

My script:

from pathlib import Path
from tqdm import trange
import logging
from docling_parse import pdf_parser

def debug_page_loop(pdf_path: Path):
    parser = pdf_parser.DoclingPdfParser()
    doc = parser.load(str(pdf_path))
    total = doc.number_of_pages()
    for p in trange(1, total + 1, desc="probing pages"):
        try:
            doc.get_page(p, create_words=False, create_textlines=False)
        except Exception as e:
            logging.error("Crash at page %s: %s", p, e)


if __name__ == "__main__":
    # debug_page_loop(Path("PATH_TO_FILE.pdf"))

When tested on this file: https://cdn.clinicaltrials.gov/large-docs/27/NCT01871727/Prot_000.pdf It returns:

ERROR:root:Crash at page 5: basic_string
ERROR:root:Crash at page 9: basic_string
ERROR:root:Crash at page 12: basic_string
ERROR:root:Crash at page 13: basic_string
ERROR:root:Crash at page 14: basic_string
ERROR:root:Crash at page 15: basic_string
ERROR:root:Crash at page 16: basic_string
ERROR:root:Crash at page 36: basic_string
ERROR:root:Crash at page 37: basic_string
ERROR:root:Crash at page 38: basic_string
ERROR:root:Crash at page 46: basic_string
ERROR:root:Crash at page 50: basic_string
ERROR:root:Crash at page 57: basic_string
ERROR:root:Crash at page 60: basic_string
ERROR:root:Crash at page 68: basic_string
ERROR:root:Crash at page 79: basic_string
ERROR:root:Crash at page 81: basic_string
ERROR:root:Crash at page 83: basic_string
ERROR:root:Crash at page 103: basic_string
ERROR:root:Crash at page 111: basic_string
ERROR:root:Crash at page 112: basic_string
ERROR:root:Crash at page 113: basic_string

Docling versions (latest at this moment):

docling                                  2.43.0
docling-core                             2.44.1
docling-ibm-models                       3.9.0
docling-parse                            4.1.0

KarolGongolaCledar avatar Aug 06 '25 10:08 KarolGongolaCledar