docling icon indicating copy to clipboard operation
docling copied to clipboard

Runtime Error since v.2.34.0 related to OSD detection

Open simonschoe opened this issue 7 months ago • 1 comments

Bug

Since v.2.34.0 I can observe the following error, presumably caused by the implementation of https://github.com/docling-project/docling/pull/1167:

...
File ...\docling\models\tesseract_ocr_model.py:105, in TesseractOcrModel.__init__(self, enabled, artifacts_path, options, accelerator_options)
    [101](file:///.../docling/models/tesseract_ocr_model.py:101) else:
    [102](file:///.../docling/models/tesseract_ocr_model.py:102)     self.reader = tesserocr.PyTessBaseAPI(
    [103](file:///.../docling/models/tesseract_ocr_model.py:103)         **{"lang": lang} | tesserocr_kwargs,
    [104](file:///.../docling/models/tesseract_ocr_model.py:104)     )
--> [105](file:///.../docling/models/tesseract_ocr_model.py:105) self.osd_reader = tesserocr.PyTessBaseAPI(
    [106](file:///.../docling/models/tesseract_ocr_model.py:106)     **{"lang": "osd", "psm": tesserocr.PSM.OSD_ONLY} | tesserocr_kwargs
    [107](file:///.../docling/models/tesseract_ocr_model.py:107) )
    [108](file:///.../docling/models/tesseract_ocr_model.py:108) self.reader_RIL = tesserocr.RIL

File tesserocr\\tesserocr.pyx:1287, in tesserocr.tesserocr.PyTessBaseAPI.__cinit__()

File tesserocr\\tesserocr.pyx:1311, in tesserocr.tesserocr.PyTessBaseAPI._init_api()

RuntimeError: Failed to init API, possibly an invalid tessdata path: .../tessdata

Note that .../tessdata contains the relevant tessdata language files (i.e. the error did not occur with v.2.33.0). Presumably, what I am missing right now ist the relevant script filesfor OSD detection: osd.traineddata.

Ideally, you should also be able to use tesserocr without having the osd tessdata file and then simply skip automatic orientation detection.

Steps to reproduce

pipeline_options.ocr_options = TesseractOcrOptions(
    lang=["eng"],
    force_full_page_ocr=False,
    bitmap_area_threshold=0.05,
    path=".../tessdata",
)

Docling version

docling 2.34.0 docling-core 2.31.1 docling-ibm-models 3.4.1 docling-parse 4.0.1

simonschoe avatar May 25 '25 06:05 simonschoe

I also encounter errors with osd using tesseract from docling, I noticed there is a path where _perform_osd throws a subprocess.CalledProcessError and if _is_auto is False the page isn't skipped afterwards _run_tesseract(fname, df_osd) is called with the uninitialised df_osd variable, which leads to a crash with:

UnboundLocalError: cannot access local variable 'df_osd' where it is not associated with a value

kelvan avatar May 27 '25 15:05 kelvan

I have the same issue for the same version

LuRe97 avatar Jun 02 '25 15:06 LuRe97

+1

igorsekulic avatar Jun 05 '25 12:06 igorsekulic

i am still getting this error

JoaoPedroMBiofy avatar Jun 24 '25 13:06 JoaoPedroMBiofy

@cau-git Any chance this will get fully resolved soon? It still impedes the use of Tesseract as OCR Engine in version 2.34.0 and later...

simonschoe avatar Jun 28 '25 19:06 simonschoe

This PR #1866 should address the issue.

nikos-livathinos avatar Jun 30 '25 09:06 nikos-livathinos