BallonsTranslator icon indicating copy to clipboard operation
BallonsTranslator copied to clipboard

Feature Request: Please add local LLM based OCR engines

Open nothingherern opened this issue 8 months ago • 8 comments

Version Info

Python version: 3.11.9 Python executable: E:\Python311\python.exe Version: 1.4.0 Branch: dev

Type of Request

New Feature

Description

Please add callisto_ocr3_2b_instruct, qwen2_vl_ocr_2B_instruct, OLM-OCR, local models as OCR engine.

I did a reasearch, and found some good OCR's, Image I tested OCR's on 5 langs [English, Japanese, Chinese, Korean, Russian], and the best of the best, its OLM-OCR Image But, it's heavy 7B model, so, better to use callisto_ocr3_2b_instruct and qwen2_vl_ocr_2B_instruct, this models are really fast and capable. As for the prompt, need to use:

Give me text from image, writen in {lang} language, nothing else.

where lang is language of image (English, Korean)

Pictures

No response

Additional Information

No response

nothingherern avatar Apr 26 '25 17:04 nothingherern

Can they be connected via Ollama or LLM STUDIO? On the Llama.cpp engine?

I am also interested in the software that you used to conduct the tests. Please provide a link or code.

bropines avatar Apr 26 '25 17:04 bropines

Can they be connected via Ollama or LLM STUDIO? On the Llama.cpp engine?

No, only via transformers, as i know

I am also interested in the software that you used to conduct the tests. Please provide a link or code.

Ok, i will upload code in 10 minutes

nothingherern avatar Apr 26 '25 17:04 nothingherern

Can they be connected via Ollama or LLM STUDIO? On the Llama.cpp engine?

No, only via transformers📋, as i know

Give me test code

bropines avatar Apr 26 '25 17:04 bropines

Ok, here is: https://github.com/nothingherern/OCR_bench main file is run_test.py,

nothingherern avatar Apr 26 '25 17:04 nothingherern

Ok, here is: https://github.com/nothingherern/OCR_bench main file is run_test.py📋,

I will think about whether it is worth integrating them directly

bropines avatar Apr 26 '25 18:04 bropines

In fact, you can use LMDeploy or Xinference to deploy local VLMs as OpenAI services, then you can directly use them via llm-ocr module when correctly configuring endpoint and override_model (LMDeploy requires strict model name match).

The shortage is, you will install 2 pytorch in your computer.

By the way, LMdeploy only supports awq and gptq quantization and full weights on pytorch engine, and it's windows installation requires windows-triton==3.2.0.post18 instead of triton. Xinference theoretically supports any quatization that Transformer supports, but its OpenAI service is a bit weird to launch that I have never tried.

Why not ollama or llama.cpp? Their support to VLMs require another projection model, which needs more VRAM to load.

Why not VLLM or SGLang? I've tried, but their accuracy is much worse than Transformers engine with a same model Qwen2.5-VL-32B, hope you could give a try and figure out why.

In my test, both Qwen2.5-VL-7B and OlmOCR perform worse than Qwen2.5-VL-32B. If you have enough VRAM, you can try Qwen2.5-VL-32B-AWQ or InternVL3-38B-AWQ, basically they are the same but InternVL's paper shows InternVL's models have better performance.

RoadToNowhereX avatar Apr 27 '25 02:04 RoadToNowhereX

Can they be connected via Ollama or LLM STUDIO? On the Llama.cpp engine?

No, only via transformers, as i know

I am also interested in the software that you used to conduct the tests. Please provide a link or code.

Ok, i will upload code in 10 minutes

if its any use to you, instead of transformers you can find gguf versions and use them with llama.cpp. I've tried using gemma3 OCR with varying degrees of success.

Alexamenus avatar Apr 29 '25 16:04 Alexamenus

Can you explain what the problem is with embedding visual models? I made a draft Qwen2-VL based on ocr_manga and it works. However, it doesn't run in bf16.

heinrichI avatar May 06 '25 16:05 heinrichI