Feature Request: Please add local LLM based OCR engines
Version Info
Python version: 3.11.9 Python executable: E:\Python311\python.exe Version: 1.4.0 Branch: dev
Type of Request
New Feature
Description
Please add callisto_ocr3_2b_instruct, qwen2_vl_ocr_2B_instruct, OLM-OCR, local models as OCR engine.
I did a reasearch, and found some good OCR's,
I tested OCR's on 5 langs [English, Japanese, Chinese, Korean, Russian], and the best of the best, its
OLM-OCR
But, it's heavy 7B model, so, better to use
callisto_ocr3_2b_instruct and qwen2_vl_ocr_2B_instruct, this models are really fast and capable.
As for the prompt, need to use:
Give me text from image, writen in {lang} language, nothing else.
where lang is language of image (English, Korean)
Pictures
No response
Additional Information
No response
Can they be connected via Ollama or LLM STUDIO? On the Llama.cpp engine?
I am also interested in the software that you used to conduct the tests. Please provide a link or code.
Can they be connected via Ollama or LLM STUDIO? On the Llama.cpp engine?
No, only via transformers, as i know
I am also interested in the software that you used to conduct the tests. Please provide a link or code.
Ok, i will upload code in 10 minutes
Can they be connected via Ollama or LLM STUDIO? On the Llama.cpp engine?
No, only via
transformers📋, as i know
Give me test code
Ok, here is: https://github.com/nothingherern/OCR_bench main file is run_test.py,
Ok, here is: https://github.com/nothingherern/OCR_bench main file is
run_test.py📋,
I will think about whether it is worth integrating them directly
In fact, you can use LMDeploy or Xinference to deploy local VLMs as OpenAI services, then you can directly use them via llm-ocr module when correctly configuring endpoint and override_model (LMDeploy requires strict model name match).
The shortage is, you will install 2 pytorch in your computer.
By the way, LMdeploy only supports awq and gptq quantization and full weights on pytorch engine, and it's windows installation requires windows-triton==3.2.0.post18 instead of triton. Xinference theoretically supports any quatization that Transformer supports, but its OpenAI service is a bit weird to launch that I have never tried.
Why not ollama or llama.cpp? Their support to VLMs require another projection model, which needs more VRAM to load.
Why not VLLM or SGLang? I've tried, but their accuracy is much worse than Transformers engine with a same model Qwen2.5-VL-32B, hope you could give a try and figure out why.
In my test, both Qwen2.5-VL-7B and OlmOCR perform worse than Qwen2.5-VL-32B. If you have enough VRAM, you can try Qwen2.5-VL-32B-AWQ or InternVL3-38B-AWQ, basically they are the same but InternVL's paper shows InternVL's models have better performance.
Can they be connected via Ollama or LLM STUDIO? On the Llama.cpp engine?
No, only via
transformers, as i knowI am also interested in the software that you used to conduct the tests. Please provide a link or code.
Ok, i will upload code in 10 minutes
if its any use to you, instead of transformers you can find gguf versions and use them with llama.cpp. I've tried using gemma3 OCR with varying degrees of success.
Can you explain what the problem is with embedding visual models? I made a draft Qwen2-VL based on ocr_manga and it works. However, it doesn't run in bf16.