docling icon indicating copy to clipboard operation
docling copied to clipboard

Make RapidOcrOptions ignore `artifacts_path`

Open simonschoe opened this issue 2 months ago • 3 comments

Requested feature

Currently, when using RapidOCR, the initialization of RapidOcrModel considers artifacts_path. That is, it searches for model artifacts under artifacts_path as defined here:

  • https://github.com/docling-project/docling/blob/6a04e273528691eb22a5708f1270d4c5fa8f5b7c/docling/models/auto_ocr_model.py#L65-L74
  • https://github.com/docling-project/docling/blob/6a04e273528691eb22a5708f1270d4c5fa8f5b7c/docling/models/rapid_ocr_model.py#L125-L149

When installing rapidocr it already ships with the base onnx models available under ...\Lib\site-packages\rapidocr\models. Therefore, we usually do not need downloading these models separately from Modelscope. However, when setting artifacts_path in the pipeline, e.g., for loading the layout detection or table structure model, we are currently not able to load the default onnx models shipped with rapidocr.

@geoHeil Would it be possible to either deliberately skip a globally defined artifacts_path in order to preload the shipped models and skip downloading from Modelscope or b) make loading from ...\Lib\site-packages\rapidocr\models the default?

simonschoe avatar Nov 04 '25 19:11 simonschoe

Sounds like it should be. Would you want to raise a PR?

geoHeil avatar Nov 04 '25 20:11 geoHeil

@geoHeil I was hoping you could fill in. 😁 Until I have set everything up to prepare a proper PR... I am not sure I currently find the time unfortunately.

simonschoe avatar Nov 04 '25 20:11 simonschoe

Hm I am a bit tight in the next weeks just as well.

geoHeil avatar Nov 04 '25 21:11 geoHeil