docling
docling copied to clipboard
Downloading detection and recognition models takes a lot of time and space on my pod
How to prevent Docling from Downloading detection and recognition models
I have deployed a REST endpoint in docker which calls Docling to parse (convert) documents. Every time the code is called, docling starts by download detection and recognition models which is time consuming and heavy on the memory. I would like to turn this feature off and prevent docling from downloading any models!
Please see my very basic code below:
def parse_with_docling(pipeline_input):
doc_converter = DocumentConverter()
input_doc_path = Path(pipeline_input.input_path)
return doc_converter.convert(input_doc_path).document
I have a hunch that this is caused by the EasyOCR. I would like to set the download_enabled to false for EasyOCR without limiting the OCR feature to EasyOCR.
Thanks in advance! Arash
You can pre-fetch the models
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
# # to explicitly prefetch:
# artifacts_path = StandardPdfPipeline.download_models_hf()
artifacts_path = "/local/path/to/artifacts"
pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
https://ds4sd.github.io/docling/usage/#provide-specific-artifacts-path
This PR should simplify all of it: https://github.com/DS4SD/docling/pull/876
@dolfim-ibm Short of implementation defects/enhancements, this can be closed, no?