docling Issue with reading downloading model

Issue with reading downloading model

Open harinisri2001 opened this issue 11 months ago • 3 comments

@dolfim-ibm I am downloading a model in the Dockerfile using the following command:

RUN python -c "from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline;
StandardPdfPipeline.download_models_hf(force=True, local_dir='/app/python/rag/resources/artifacts/')"

The model is successfully downloaded to the specified location. Despite this, my component still attempts to download the model again the first time it is used.

Dec 12 '24 11:12 harinisri2001

Could it be you are downloading the OCR models? in your code do_ocr is a parameter, can you please try once making sure it is False

Dec 12 '24 12:12 dolfim-ibm

@dolfim-ibm Like layout and tableformer is there a way to prefetch the OCR models as well?

Dec 12 '24 12:12 harinisri2001

@dolfim-ibm Like layout and tableformer is there a way to prefetch the OCR models as well?

it depends on the specific OCR engine. as a non-complete summary:

Tesseract requires the models to be installed as system packages
EasyOCR has options for disabling the download of the models and set the actual path, see https://ds4sd.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions

Dec 12 '24 12:12 dolfim-ibm

@harinisri2001 I hope your issue is addressed with the pointers from @dolfim-ibm. I will close this until further feedback.

Dec 18 '24 11:12 cau-git

docling docling copied to clipboard

Issue with reading downloading model

docling
docling copied to clipboard