docling
docling copied to clipboard
feat: [Experimental] New VLM Pipeline leveraging vision models
Preliminary integration with SmolDocling model and VLM Pipeline:
- SmolDocling inference model
- New VLM Pipeline that uses SmolDocling model
- Assembly code that builds Docling document from Doc-tags format predicted by SmolDocling
- Example of how to use
- Rudimentary speed measurement logging
Checklist:
- [ ] Documentation has been updated, if necessary.
- [x] Examples have been added, if necessary.
- [ ] Tests have been added, if necessary.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
I'm summarizing here the target of this PR, I will submit code proposals later.
VlmPipeline
Specs of the new pipeline
- Input: (PDF) Document
- Processing: using a vision language model
- Output: DoclingDocument
Implementations
SmolDocling
Here the model will produce accurate DocTags which are converted (in the assemble step) to a DoclingDocument.
Other DocTags models
In the future we expect more models producing DocTags, which would go through the same assembling step of SmolDocling.
Other intermediate outputs
The pipeline will also support the case of VLMs producing a different intermediate representation. For example, models producing Markdown output, then we internally reuse the Markdown backend to create the DoclingDocument.
Wrap up
We definitely don't have to implement more than what it is nicely done in the PR, but a few naming (specially in the options) could be tuned for being ready for the next steps.
My suggestion is to use the vlm_options as the discriminator which, in the future, will decide things like 1) which model to call, 2) which type of internal assemble.
I would at least introduce from the beginning the kind in the options.
This was merged on a derived PR.