docling icon indicating copy to clipboard operation
docling copied to clipboard

Addition of local vlm folder support

Open Navanit-git opened this issue 9 months ago • 7 comments

Hi, This PR is used to let the user use the local repository or the model downloaded in their path.

I have ran this in my local.

from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_picture_description = True
pipeline_options.picture_description_options = PictureDescriptionVlmOptions(
     repo_id= "/opt/nav/Qwen/Qwen2.5-VL-7B-Instruct",  # <-- add here the Hugging Face repo_id of your favorite VLM
    prompt="Extract the text from the images, if it is table extract the table format.If there are no text give 'No Image Text' response",
)
pipeline_options.images_scale = 2.0
pipeline_options.generate_picture_images = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        )
    }
)

Checklist:

  • [ ] Documentation has been updated, if necessary.
  • [x] Examples have been added, if necessary.
  • [x] Tests have been added, if necessary.

Navanit-git avatar Feb 25 '25 09:02 Navanit-git

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Enforce conventional commit

This rule is failing.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [ ] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar Feb 25 '25 09:02 mergify[bot]

@dolfim-ibm any review?

Navanit-git avatar Feb 25 '25 13:02 Navanit-git

@dolfim-ibm any review?

Could you please see if #1057 would be enough for your use case?

We just started the approach of using artifacts_path consistently across models. In short: if it is defined, it will load all models locally.

dolfim-ibm avatar Feb 25 '25 13:02 dolfim-ibm

@dolfim-ibm any review?

Could you please see if #1057 would be enough for your use case?

We just started the approach of using artifacts_path consistently across models. In short: if it is defined, it will load all models locally.

I don't think so this will resolve the error. The local file will again be downloaded.

Navanit-git avatar Feb 25 '25 13:02 Navanit-git

@dolfim-ibm The #1057 doesn't solve the artifact path, since now it wants to download the easyocr and other models in that path. So, artifact path will make the path default.

But this PR will make the path as a dynamic for every ocr Models using repo id .

cc @cau-git

Navanit-git avatar Feb 25 '25 15:02 Navanit-git

@Navanit-git From what I can see, your proposed change would not effect too much:

  1. If you use HF datasets snapshot_download, it will only re-transfer the actual assets if the content found in the cache-dir or the local_dir (if provided) is not current.
  2. Your change is applied only for the PictureDescriptionVlmModel, and no other model.

We established the CLI model downloader, and an analogous model download API to make it easy to pre-download models. However, if you want to work with pre-downloaded models and provide an artifacts_path to the converter, it will no longer even check on HuggingFace for new weights. This is the intended behaviour, which works well also for container environments where you might have non-standard model artifacts location (e.g. in writeable directory) or no networking at runtime.

Could you please explain what functionality you miss, given this update?

cau-git avatar Feb 26 '25 13:02 cau-git

@Navanit-git From what I can see, your proposed change would not effect too much:

  1. If you use HF datasets snapshot_download, it will only re-transfer the actual assets if the content found in the cache-dir or the local_dir (if provided) is not current.
  2. Your change is applied only for the PictureDescriptionVlmModel, and no other model.

We established the CLI model downloader, and an analogous model download API to make it easy to pre-download models. However, if you want to work with pre-downloaded models and provide an artifacts_path to the converter, it will no longer even check on HuggingFace for new weights. This is the intended behaviour, which works well also for container environments where you might have non-standard model artifacts location (e.g. in writeable directory) or no networking at runtime.

Could you please explain what functionality you miss, given this update?

Hey, thank you for the review. So basically I have a PDF in which I want to do an OCR using any vlm models. So I followed the below link steps. https://ds4sd.github.io/docling/examples/pictures_description/

I have downloaded the vlm models earlier in my local path which is not in a cache folder. So when I give a repo_id of that vlm model it gives an error hf error since it starts downloading. So to support this I have added one line that if the repo id path exist already then don't download give the path. And regarding updating the weights, if there is any update in any models whenever the model gets loaded using transformer it will automatically update the weights or any change in model.

Yes, I know this is a minor changes since I have a project to deliver where I have to ocr the pdf file with images to md file/text file and I thought this simple change can help. If its redundant, kindly close this pr.

Also one small tiny request, export to markdown format when can we expect the image description, since I was working on it , and instead of image placeholder we can get the image description and I think I am very close to getting that, by changing in your library docling_core but there is a time crunch for me to work on my project.

Navanit-git avatar Feb 26 '25 14:02 Navanit-git

@dolfim-ibm @cau-git ...

Navanit-git avatar Mar 06 '25 09:03 Navanit-git

export to markdown format when can we expect the image description

this is coming very soon.

dolfim-ibm avatar Mar 07 '25 13:03 dolfim-ibm

Regarding the overall PR, my opinion is in a few bullets.

  1. If you download a model from HF, you could simply moving it to folder with all the other artifacts. The fact the default is called cache, it is just naming; it is simply a folder with all the artifacts.
  2. Using the repo_id as path might get into unknown issues. Users might have a folder folder matching some random HF model and then the load is not working.
  3. The idea of allowing a local path is good. Maybe we a more explicit argument for it which is not reusing repo_id.

dolfim-ibm avatar Mar 07 '25 14:03 dolfim-ibm

Regarding the overall PR, my opinion is in a few bullets.

  1. If you download a model from HF, you could simply moving it to folder with all the other artifacts. The fact the default is called cache, it is just naming; it is simply a folder with all the artifacts.
  2. Using the repo_id as path might get into unknown issues. Users might have a folder folder matching some random HF model and then the load is not working.
  3. The idea of allowing a local path is good. Maybe we a more explicit argument for it which is not reusing repo_id.

thank you @dolfim-ibm for now in my local I have done the above pr changes in docling library and its working fine. But we never know if we will find any error in the future see if we can add patches for it or something.

Navanit-git avatar Mar 07 '25 14:03 Navanit-git

As mentioned above, we simply don't want to overload the usage of the repo_id parameter. Adding a new new local_path parameter for the options would be a good way to proceed.

If you are willing to update your PR with such an option we can keep it open, otherwise we will close it.

dolfim-ibm avatar Mar 19 '25 08:03 dolfim-ibm

@Navanit-git We haven't seen an update in two weeks, hence I will close this PR. Feel free to re-open it if you want to follow up again. Thanks!

cau-git avatar Mar 31 '25 09:03 cau-git