docling
docling copied to clipboard
feat: adding new vlm-models support
Added several VLM backends
- [x] MLX
- [x]
AutoModelForCausalLM - [x]
AutoModelForVision2Seq - [x]
LlavaForConditionalGeneration
Added support for running the following VLM's
- [x] SmolDocling (native and MLX)
- [x] GraniteVision (native and olama, no working mlx version available)
- [x] Qwen2.5-VL (MLX)
- [x] Pixtral-2b (native and MLX)
- [x] Phi4-multimodal-instruct (native, no working mlx version available)
Others thet will come in next feature: gemma, llama-vl
Checklist:
- [ ] Documentation has been updated, if necessary.
- [X] Examples have been added, if necessary.
- [X] Tests have been updated (minimal_vlm_pipeline).
Examples
You can run
caffeinate poetry run python ./docs/examples/minimal_vlm_pipeline.py
to quickly obtain conversion results for different VLM (in ./scratch foldser) with timings (M3 ultra),
| input file | model | vlm-framework | time [sec] |
|---|---|---|---|
| 2305.03393v1-pg9.pdf | ds4sd_SmolDocling-256M-preview-mlx-bf16 | InferenceFramework.MLX | 6.02189 |
| 2305.03393v1-pg9.pdf | mlx-community_Qwen2.5-VL-3B-Instruct-bf16 | InferenceFramework.MLX | 23.4069 |
| 2305.03393v1-pg9.pdf | mlx-community_pixtral-12b-bf16 | InferenceFramework.MLX | 287.485 |
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
Codecov Report
Attention: Patch coverage is 53.74332% with 173 lines in your changes missing coverage. Please review.
:loudspeaker: Thoughts on this report? Let us know!
@cau-git You can have a review and then we can discuss, I am sure it is not yet 💯 , but we are getting closer.
| source | model_id | framework | num_pages | time |
|---|---|---|---|---|
| tests/data/pdf/2305.03393v1-pg9.pdf | ds4sd_SmolDocling-256M-preview | InferenceFramework.TRANSFORMERS_VISION2SEQ | 1 | 102.212 |
| tests/data/pdf/2305.03393v1-pg9.pdf | ds4sd_SmolDocling-256M-preview-mlx-bf16 | InferenceFramework.MLX | 1 | 6.15453 |
| tests/data/pdf/2305.03393v1-pg9.pdf | mlx-community_Qwen2.5-VL-3B-Instruct-bf16 | InferenceFramework.MLX | 1 | 23.4951 |
| tests/data/pdf/2305.03393v1-pg9.pdf | mlx-community_pixtral-12b-bf16 | InferenceFramework.MLX | 1 | 308.856 |
| tests/data/pdf/2305.03393v1-pg9.pdf | mlx-community_gemma-3-12b-it-bf16 | InferenceFramework.MLX | 1 | 378.486 |
| tests/data/pdf/2305.03393v1-pg9.pdf | ibm-granite_granite-vision-3.2-2b | InferenceFramework.TRANSFORMERS_VISION2SEQ | 1 | 104.75 |
| tests/data/pdf/2305.03393v1-pg9.pdf | microsoft_Phi-4-multimodal-instruct | InferenceFramework.TRANSFORMERS_CAUSALLM | 1 | 1175.67 |
| tests/data/pdf/2305.03393v1-pg9.pdf | mistral-community_pixtral-12b | InferenceFramework.TRANSFORMERS_VISION2SEQ | 1 | 1828.21 |
Hi @PeterStaar-IBM, @dolfim-ibm and @cau-git, This update is really great thanks, I tested it and works really well. I have some Optimization suggestions for the remote api solution:
- Can you please add a retry option (int), to say how many times the requests should retry when it failed ? As OpenAI SDK does ? https://github.com/openai/openai-python?tab=readme-ov-file#retries
- It would be nice of having the possibility to configure the http client (something like httpx ), so that it can help to configure extra things like proxies...
- What about giving the possibility to run the request of all (or a group of) pages in parallel, so that we must not wait for one page before asking the markdown of the next page. and then with something like asyncio.gather sort them all when it is done. It can really be a time saver.
Thanks for the great work you are doing it's really amazing 💪