docling icon indicating copy to clipboard operation
docling copied to clipboard

missing compatibility with .safetensor version of docling-models

Open jbdebard opened this issue 1 year ago • 1 comments

Bug

When trying to work with .safetensor version of ds4sd/docling-models and provide the artifacts_path in docling.datamodel.pipeline_options.PdfPipelineOptions, the StandardPdfPipeline is looking for

    _layout_model_path = "model_artifacts/layout/beehive_v0.0.5_pt"
    _table_model_path = "model_artifacts/tableformer"

but in the last version (.safetensor) of ds4sd/docling-models, the folder structure has changed and models' weights format as well and importing models fails.

Steps to reproduce

from docling.document_converter import DocumentConverter, PdfFormatOption, InputFormat
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.artifacts_path = artifacts_path #path to self-imported docling-models (.safetensor)
        
pdf_converter = DocumentConverter(
    format_options = dict({
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls = StandardPdfPipeline,
            pipeline_options = pipeline_options,
            backend = PyPdfiumDocumentBackend,
        ),
    })
)

your_doc = layout("path/to/your.pdf")

converted_doc = pdf_converter.convert(source=your_doc).document

Docling version

docling==2.11.0 docling-core==2.9.0 docling-ibm-models==2.0.6 docling-parse==3.0.0

Python version

3.11 ...

jbdebard avatar Dec 13 '24 10:12 jbdebard

@jbdebard The ref pulled by docling in the current version is v2.0.1 for this reason. If you want to pull the right version of the HF docling-models for current docling, you need to pull the same revision as docling would, which is encoded here.

The main branch of the HF docling-models is ahead, and will be used by a yet-to-release docling version. The safetensors-enabled version of docling is still scheduled for today.

cau-git avatar Dec 13 '24 12:12 cau-git

@jbdebard the safetensors model versions are used in docling since 2.12.0. Closing this issue as completed.

cau-git avatar Dec 18 '24 12:12 cau-git