trafficstars

Added several VLM backends

[x] MLX
[x] AutoModelForCausalLM
[x] AutoModelForVision2Seq
[x] LlavaForConditionalGeneration

Added support for running the following VLM's

[x] SmolDocling (native and MLX)
[x] GraniteVision (native and olama, no working mlx version available)
[x] Qwen2.5-VL (MLX)
[x] Pixtral-2b (native and MLX)
[x] Phi4-multimodal-instruct (native, no working mlx version available)

Others thet will come in next feature: gemma, llama-vl

Checklist:

[ ] Documentation has been updated, if necessary.
[X] Examples have been added, if necessary.
[X] Tests have been updated (minimal_vlm_pipeline).

Examples

You can run

caffeinate poetry run python ./docs/examples/minimal_vlm_pipeline.py

to quickly obtain conversion results for different VLM (in ./scratch foldser) with timings (M3 ultra),

input file	model	vlm-framework	time [sec]
2305.03393v1-pg9.pdf	ds4sd_SmolDocling-256M-preview-mlx-bf16	InferenceFramework.MLX	6.02189
2305.03393v1-pg9.pdf	mlx-community_Qwen2.5-VL-3B-Instruct-bf16	InferenceFramework.MLX	23.4069
2305.03393v1-pg9.pdf	mlx-community_pixtral-12b-bf16	InferenceFramework.MLX	287.485

May 11 '25 07:05 PeterStaar-IBM

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

[X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

May 11 '25 07:05 mergify[bot]

Codecov Report

Attention: Patch coverage is 53.74332% with 173 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/pipeline/vlm_pipeline.py	10.63%	84 Missing :warning:
.../models/vlm_models_inline/hf_transformers_model.py	25.00%	57 Missing :warning:
docling/models/vlm_models_inline/mlx_model.py	28.57%	15 Missing :warning:
docling/utils/accelerator_utils.py	40.00%	6 Missing :warning:
docling/cli/main.py	33.33%	4 Missing :warning:
docling/utils/model_downloader.py	33.33%	4 Missing :warning:
docling/datamodel/pipeline_options_vlm_model.py	98.03%	1 Missing :warning:
docling/document_converter.py	50.00%	1 Missing :warning:
docling/models/utils/hf_model_download.py	93.33%	1 Missing :warning:

:loudspeaker: Thoughts on this report? Let us know!

May 18 '25 05:05 codecov[bot]

@cau-git You can have a review and then we can discuss, I am sure it is not yet 💯 , but we are getting closer.

May 18 '25 09:05 PeterStaar-IBM

source model_id framework num_pages time

tests/data/pdf/2305.03393v1-pg9.pdf ds4sd_SmolDocling-256M-preview InferenceFramework.TRANSFORMERS_VISION2SEQ 1 102.212

tests/data/pdf/2305.03393v1-pg9.pdf ds4sd_SmolDocling-256M-preview-mlx-bf16 InferenceFramework.MLX 1 6.15453

tests/data/pdf/2305.03393v1-pg9.pdf mlx-community_Qwen2.5-VL-3B-Instruct-bf16 InferenceFramework.MLX 1 23.4951

tests/data/pdf/2305.03393v1-pg9.pdf mlx-community_pixtral-12b-bf16 InferenceFramework.MLX 1 308.856

tests/data/pdf/2305.03393v1-pg9.pdf mlx-community_gemma-3-12b-it-bf16 InferenceFramework.MLX 1 378.486

tests/data/pdf/2305.03393v1-pg9.pdf ibm-granite_granite-vision-3.2-2b InferenceFramework.TRANSFORMERS_VISION2SEQ 1 104.75

tests/data/pdf/2305.03393v1-pg9.pdf microsoft_Phi-4-multimodal-instruct InferenceFramework.TRANSFORMERS_CAUSALLM 1 1175.67

tests/data/pdf/2305.03393v1-pg9.pdf mistral-community_pixtral-12b InferenceFramework.TRANSFORMERS_VISION2SEQ 1 1828.21

source	model_id	framework	num_pages	time
tests/data/pdf/2305.03393v1-pg9.pdf	ds4sd_SmolDocling-256M-preview	InferenceFramework.TRANSFORMERS_VISION2SEQ	1	102.212
tests/data/pdf/2305.03393v1-pg9.pdf	ds4sd_SmolDocling-256M-preview-mlx-bf16	InferenceFramework.MLX	1	6.15453
tests/data/pdf/2305.03393v1-pg9.pdf	mlx-community_Qwen2.5-VL-3B-Instruct-bf16	InferenceFramework.MLX	1	23.4951
tests/data/pdf/2305.03393v1-pg9.pdf	mlx-community_pixtral-12b-bf16	InferenceFramework.MLX	1	308.856
tests/data/pdf/2305.03393v1-pg9.pdf	mlx-community_gemma-3-12b-it-bf16	InferenceFramework.MLX	1	378.486
tests/data/pdf/2305.03393v1-pg9.pdf	ibm-granite_granite-vision-3.2-2b	InferenceFramework.TRANSFORMERS_VISION2SEQ	1	104.75
tests/data/pdf/2305.03393v1-pg9.pdf	microsoft_Phi-4-multimodal-instruct	InferenceFramework.TRANSFORMERS_CAUSALLM	1	1175.67
tests/data/pdf/2305.03393v1-pg9.pdf	mistral-community_pixtral-12b	InferenceFramework.TRANSFORMERS_VISION2SEQ	1	1828.21

Jun 01 '25 19:06 dolfim-ibm

Hi @PeterStaar-IBM, @dolfim-ibm and @cau-git, This update is really great thanks, I tested it and works really well. I have some Optimization suggestions for the remote api solution:

Can you please add a retry option (int), to say how many times the requests should retry when it failed ? As OpenAI SDK does ? https://github.com/openai/openai-python?tab=readme-ov-file#retries
It would be nice of having the possibility to configure the http client (something like httpx ), so that it can help to configure extra things like proxies...
What about giving the possibility to run the request of all (or a group of) pages in parallel, so that we must not wait for one page before asking the markdown of the next page. and then with something like asyncio.gather sort them all when it is done. It can really be a time saver.

Thanks for the great work you are doing it's really amazing 💪

Jun 06 '25 18:06 KapyGenius

docling
docling copied to clipboard

feat: adding new vlm-models support

Merge Protections

🟢 Enforce conventional commit

Codecov Report

docling docling copied to clipboard

feat: adding new vlm-models support

Merge Protections

🟢 Enforce conventional commit

Codecov Report

docling
docling copied to clipboard