Runtime Error since v.2.34.0 related to OSD detection
Bug
Since v.2.34.0 I can observe the following error, presumably caused by the implementation of https://github.com/docling-project/docling/pull/1167:
...
File ...\docling\models\tesseract_ocr_model.py:105, in TesseractOcrModel.__init__(self, enabled, artifacts_path, options, accelerator_options)
[101](file:///.../docling/models/tesseract_ocr_model.py:101) else:
[102](file:///.../docling/models/tesseract_ocr_model.py:102) self.reader = tesserocr.PyTessBaseAPI(
[103](file:///.../docling/models/tesseract_ocr_model.py:103) **{"lang": lang} | tesserocr_kwargs,
[104](file:///.../docling/models/tesseract_ocr_model.py:104) )
--> [105](file:///.../docling/models/tesseract_ocr_model.py:105) self.osd_reader = tesserocr.PyTessBaseAPI(
[106](file:///.../docling/models/tesseract_ocr_model.py:106) **{"lang": "osd", "psm": tesserocr.PSM.OSD_ONLY} | tesserocr_kwargs
[107](file:///.../docling/models/tesseract_ocr_model.py:107) )
[108](file:///.../docling/models/tesseract_ocr_model.py:108) self.reader_RIL = tesserocr.RIL
File tesserocr\\tesserocr.pyx:1287, in tesserocr.tesserocr.PyTessBaseAPI.__cinit__()
File tesserocr\\tesserocr.pyx:1311, in tesserocr.tesserocr.PyTessBaseAPI._init_api()
RuntimeError: Failed to init API, possibly an invalid tessdata path: .../tessdata
Note that .../tessdata contains the relevant tessdata language files (i.e. the error did not occur with v.2.33.0). Presumably, what I am missing right now ist the relevant script filesfor OSD detection: osd.traineddata.
Ideally, you should also be able to use tesserocr without having the osd tessdata file and then simply skip automatic orientation detection.
Steps to reproduce
pipeline_options.ocr_options = TesseractOcrOptions(
lang=["eng"],
force_full_page_ocr=False,
bitmap_area_threshold=0.05,
path=".../tessdata",
)
Docling version
docling 2.34.0 docling-core 2.31.1 docling-ibm-models 3.4.1 docling-parse 4.0.1
I also encounter errors with osd using tesseract from docling, I noticed there is a path where _perform_osd throws a subprocess.CalledProcessError and if _is_auto is False the page isn't skipped afterwards _run_tesseract(fname, df_osd) is called with the uninitialised df_osd variable, which leads to a crash with:
UnboundLocalError: cannot access local variable 'df_osd' where it is not associated with a value
I have the same issue for the same version
+1
i am still getting this error
@cau-git Any chance this will get fully resolved soon? It still impedes the use of Tesseract as OCR Engine in version 2.34.0 and later...
This PR #1866 should address the issue.