docling icon indicating copy to clipboard operation
docling copied to clipboard

feat: adding new vlm-models support

Open PeterStaar-IBM opened this issue 6 months ago • 3 comments
trafficstars

Added several VLM backends

  • [x] MLX
  • [x] AutoModelForCausalLM
  • [x] AutoModelForVision2Seq
  • [x] LlavaForConditionalGeneration

Added support for running the following VLM's

  • [x] SmolDocling (native and MLX)
  • [x] GraniteVision (native and olama, no working mlx version available)
  • [x] Qwen2.5-VL (MLX)
  • [x] Pixtral-2b (native and MLX)
  • [x] Phi4-multimodal-instruct (native, no working mlx version available)

Others thet will come in next feature: gemma, llama-vl

Checklist:

  • [ ] Documentation has been updated, if necessary.
  • [X] Examples have been added, if necessary.
  • [X] Tests have been updated (minimal_vlm_pipeline).

Examples

You can run

caffeinate poetry run python ./docs/examples/minimal_vlm_pipeline.py

to quickly obtain conversion results for different VLM (in ./scratch foldser) with timings (M3 ultra),

input file model vlm-framework time [sec]
2305.03393v1-pg9.pdf ds4sd_SmolDocling-256M-preview-mlx-bf16 InferenceFramework.MLX 6.02189
2305.03393v1-pg9.pdf mlx-community_Qwen2.5-VL-3B-Instruct-bf16 InferenceFramework.MLX 23.4069
2305.03393v1-pg9.pdf mlx-community_pixtral-12b-bf16 InferenceFramework.MLX 287.485

PeterStaar-IBM avatar May 11 '25 07:05 PeterStaar-IBM

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar May 11 '25 07:05 mergify[bot]

@cau-git You can have a review and then we can discuss, I am sure it is not yet 💯 , but we are getting closer.

PeterStaar-IBM avatar May 18 '25 09:05 PeterStaar-IBM

source model_id framework num_pages time
tests/data/pdf/2305.03393v1-pg9.pdf ds4sd_SmolDocling-256M-preview InferenceFramework.TRANSFORMERS_VISION2SEQ 1 102.212
tests/data/pdf/2305.03393v1-pg9.pdf ds4sd_SmolDocling-256M-preview-mlx-bf16 InferenceFramework.MLX 1 6.15453
tests/data/pdf/2305.03393v1-pg9.pdf mlx-community_Qwen2.5-VL-3B-Instruct-bf16 InferenceFramework.MLX 1 23.4951
tests/data/pdf/2305.03393v1-pg9.pdf mlx-community_pixtral-12b-bf16 InferenceFramework.MLX 1 308.856
tests/data/pdf/2305.03393v1-pg9.pdf mlx-community_gemma-3-12b-it-bf16 InferenceFramework.MLX 1 378.486
tests/data/pdf/2305.03393v1-pg9.pdf ibm-granite_granite-vision-3.2-2b InferenceFramework.TRANSFORMERS_VISION2SEQ 1 104.75
tests/data/pdf/2305.03393v1-pg9.pdf microsoft_Phi-4-multimodal-instruct InferenceFramework.TRANSFORMERS_CAUSALLM 1 1175.67
tests/data/pdf/2305.03393v1-pg9.pdf mistral-community_pixtral-12b InferenceFramework.TRANSFORMERS_VISION2SEQ 1 1828.21

dolfim-ibm avatar Jun 01 '25 19:06 dolfim-ibm

Hi @PeterStaar-IBM, @dolfim-ibm and @cau-git, This update is really great thanks, I tested it and works really well. I have some Optimization suggestions for the remote api solution:

  • Can you please add a retry option (int), to say how many times the requests should retry when it failed ? As OpenAI SDK does ? https://github.com/openai/openai-python?tab=readme-ov-file#retries
  • It would be nice of having the possibility to configure the http client (something like httpx ), so that it can help to configure extra things like proxies...
  • What about giving the possibility to run the request of all (or a group of) pages in parallel, so that we must not wait for one page before asking the markdown of the next page. and then with something like asyncio.gather sort them all when it is done. It can really be a time saver.

Thanks for the great work you are doing it's really amazing 💪

KapyGenius avatar Jun 06 '25 18:06 KapyGenius