feat: [Experimental] New VLM Pipeline leveraging vision models

Open maxmnemonic opened this issue 10 months ago • 2 comments

trafficstars

Preliminary integration with SmolDocling model and VLM Pipeline:

SmolDocling inference model
New VLM Pipeline that uses SmolDocling model
Assembly code that builds Docling document from Doc-tags format predicted by SmolDocling
Example of how to use
Rudimentary speed measurement logging

Checklist:

[ ] Documentation has been updated, if necessary.
[x] Examples have been added, if necessary.
[ ] Tests have been added, if necessary.

SmolDocling

Jan 08 '25 09:01 maxmnemonic

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

[X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Jan 08 '25 09:01 mergify[bot]

I'm summarizing here the target of this PR, I will submit code proposals later.

`VlmPipeline`

Specs of the new pipeline

Input: (PDF) Document
Processing: using a vision language model
Output: DoclingDocument

Implementations

`SmolDocling`

Here the model will produce accurate DocTags which are converted (in the assemble step) to a DoclingDocument.

Other DocTags models

In the future we expect more models producing DocTags, which would go through the same assembling step of SmolDocling.

Other intermediate outputs

The pipeline will also support the case of VLMs producing a different intermediate representation. For example, models producing Markdown output, then we internally reuse the Markdown backend to create the DoclingDocument.

Wrap up

We definitely don't have to implement more than what it is nicely done in the PR, but a few naming (specially in the options) could be tuned for being ready for the next steps.

My suggestion is to use the vlm_options as the discriminator which, in the future, will decide things like 1) which model to call, 2) which type of internal assemble.

I would at least introduce from the beginning the kind in the options.

Feb 13 '25 15:02 dolfim-ibm

This was merged on a derived PR.

Feb 26 '25 19:02 cau-git

docling docling copied to clipboard

feat: [Experimental] New VLM Pipeline leveraging vision models

Merge Protections

🟢 Enforce conventional commit

VlmPipeline

Implementations

SmolDocling

Other DocTags models

Other intermediate outputs

Wrap up

docling
docling copied to clipboard

`VlmPipeline`

`SmolDocling`