docling icon indicating copy to clipboard operation
docling copied to clipboard

Issue with reading downloading model

Open harinisri2001 opened this issue 11 months ago • 3 comments

@dolfim-ibm I am downloading a model in the Dockerfile using the following command:

RUN python -c "from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline;
StandardPdfPipeline.download_models_hf(force=True, local_dir='/app/python/rag/resources/artifacts/')"

The model is successfully downloaded to the specified location. Despite this, my component still attempts to download the model again the first time it is used. image

harinisri2001 avatar Dec 12 '24 11:12 harinisri2001

Could it be you are downloading the OCR models? in your code do_ocr is a parameter, can you please try once making sure it is False

dolfim-ibm avatar Dec 12 '24 12:12 dolfim-ibm

@dolfim-ibm Like layout and tableformer is there a way to prefetch the OCR models as well?

harinisri2001 avatar Dec 12 '24 12:12 harinisri2001

@dolfim-ibm Like layout and tableformer is there a way to prefetch the OCR models as well?

it depends on the specific OCR engine. as a non-complete summary:

  • Tesseract requires the models to be installed as system packages
  • EasyOCR has options for disabling the download of the models and set the actual path, see https://ds4sd.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.EasyOcrOptions

dolfim-ibm avatar Dec 12 '24 12:12 dolfim-ibm

@harinisri2001 I hope your issue is addressed with the pointers from @dolfim-ibm. I will close this until further feedback.

cau-git avatar Dec 18 '24 11:12 cau-git