docling icon indicating copy to clipboard operation
docling copied to clipboard

feat: [Experimental] New VLM Pipeline leveraging vision models

Open maxmnemonic opened this issue 10 months ago • 2 comments
trafficstars

Preliminary integration with SmolDocling model and VLM Pipeline:

  • SmolDocling inference model
  • New VLM Pipeline that uses SmolDocling model
  • Assembly code that builds Docling document from Doc-tags format predicted by SmolDocling
  • Example of how to use
  • Rudimentary speed measurement logging

Checklist:

  • [ ] Documentation has been updated, if necessary.
  • [x] Examples have been added, if necessary.
  • [ ] Tests have been added, if necessary.

SmolDocling

maxmnemonic avatar Jan 08 '25 09:01 maxmnemonic

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar Jan 08 '25 09:01 mergify[bot]

I'm summarizing here the target of this PR, I will submit code proposals later.

VlmPipeline

Specs of the new pipeline

  • Input: (PDF) Document
  • Processing: using a vision language model
  • Output: DoclingDocument

Implementations

SmolDocling

Here the model will produce accurate DocTags which are converted (in the assemble step) to a DoclingDocument.

Other DocTags models

In the future we expect more models producing DocTags, which would go through the same assembling step of SmolDocling.

Other intermediate outputs

The pipeline will also support the case of VLMs producing a different intermediate representation. For example, models producing Markdown output, then we internally reuse the Markdown backend to create the DoclingDocument.

Wrap up

We definitely don't have to implement more than what it is nicely done in the PR, but a few naming (specially in the options) could be tuned for being ready for the next steps.

My suggestion is to use the vlm_options as the discriminator which, in the future, will decide things like 1) which model to call, 2) which type of internal assemble.

I would at least introduce from the beginning the kind in the options.

dolfim-ibm avatar Feb 13 '25 15:02 dolfim-ibm

This was merged on a derived PR.

cau-git avatar Feb 26 '25 19:02 cau-git