Feature Request: Can llama.cpp add support for DeepSeek OCR?
Prerequisites
- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
LLMs require the assistance of some upstream and downstream models for processing during inference. so can llama.cpp support DeepSeek OCR?
Motivation
https://huggingface.co/deepseek-ai/DeepSeek-OCR
Possible Implementation
No response
In case it helps, DeepSeek has shared the code for the VLLM implementation here: DeepSeek OCR VLLM
same request
I'd like to know if it's possible to convert the already existent DeepSeek-OCR to the llama.cpp format. Thanks !
I'd be happy to help if no one else is working on it (I'm looking to deploy it on my server as well haha)
Given the efficiency of the model, would be very interesting to see it supported. 64 visual tokens for a large text image is stunning. This model might be pivotal for a ton of use cases, aside of being a decent visual model. And most of those use cases are not degrading with strong quantization. The possibilities it opens up are very interesting.
Bumping this thread. Would appreciate hearing one way or the other if this is going to be supported
(I know this comment kinda defeat the purpose) but please guys don't "bump this thread", this is not 4chan, a thumb up to the original post is enough to "bump it", and it prevents spamming everybody that is following this thread with useless comments
I'd be happy to help if no one else is working on it (I'm looking to deploy it on my server as well haha)
Just do it :)
@MaoJianwei actually I am working on it. this might take a while.
I found that DeepSeek OCR GGUF is available: https://huggingface.co/NexaAI/DeepSeek-OCR-GGUF
kindly do it waiting impatiently for it..
Quick update: I’ve been working on DeepSeek-OCR implementation for a while. Its architecture is quite unique—particularly the two-ViT-encoder design—which differs from the VL models currently implemented in llama.cpp. Combined with the fact that I’m still getting familiar with the mtmd code, it’s taken some time to sort out all of the details.
I haven’t opened a PR yet, but I expect to have one ready this weekend.
Here’s my working branch: https://github.com/sfallah/llama.cpp/tree/sf/deepseek-ocr
@bluebread I’d be happy to collaborate with you or anyone else already working on this too.
@sfallah I read your code and I'm at a similar place with this. SAM's local attention requires ggml_win_part & ggml_win_unpart operations for the 14x14 window partitioning, and it also requires ggml_add_rel_pos & ggml_add_rel_pos operations to apply relative position embedding, but these operations aren't currently supported in the CUDA backend. Also, it seems that we have to twist the LM code (I am checking the details of llm_build_deepseek2). Perhaps we can implement a CPU version first and then add CUDA backend support? It would be great if we work together and get it done this weekend!!
@bluebread Yes, I agree — the priority is to get a correct first implementation (including the converter) running, even if it’s CPU-backend-only at the start.
And absolutely, happy to continue working together on this, through the weekend and beyond. Let me know if you’d like to have write access to my branch.
@sfallah Yes, please give me write access. I'll move my code to your branch. Edit: I've already debugged your convert_hf_to_gguf.py script and got it working for the vision model. Ready when you are :)
@bluebread Great, can you please open a PR to my branch? PR makes it easier to work on the code together.
You both are awesome! Thank you for your efforts @bluebread @sfallah
I know you all are probably interested in pure CPU implementations, but just as a data point... I tried DeepSeek OCR today on my blackwell 6000 pro using vLLM (main) and it processed a 20 page pdf to markdown in about a minute.
I see ollama support ds ocr now, dose that helps llamacpp ? I already migrate most model api to llamacpp
Hello guys,
First of all, thank you very much for your efforts, really appreciated.
I am very keen to try your implementation. Is there an official .gguf known to work that can be downloaded?
@createthis It will support CUDA backend (and hopefully other major backends as well). It looks like there's a way to work around the limitation I said earlier. @wenerme thx I'll take a look. @skoulik Sorry, we're not done yet... the LM part is implemented, but the vision model is still in progress...
@bluebread I've created an MPS/CPU compatible version of DeepSeek-OCR model here: https://huggingface.co/Dogacel/DeepSeek-OCR-Metal-MPS
I hope it helps you somehow.
Some update on the state of PR: https://github.com/ggml-org/llama.cpp/pull/17400
The PR is still a draft and we are still proceeding with development.
Good news is that we have solved all the major issues/blockers like multi-device support. This means that we have currently a vision-model that runs on Metal (my own dev laptop) and will also run on all major backends/device, CUDA included.
My estimate is that we will open the PR for review till Monday (24.05.25) and hopefully the maintainers (the code owners) will have time for our PR so we can get it done next week.
Hi, I have a question / feature request about DeepSeek-OCR support in llama.cpp.
In the Transformers implementation of DeepSeek-OCR you can intercept the encoder output and use it as a compact latent representation. This enables real compression workflows, not just standard OCR.
Concretely, Transformers lets you:
- Run the vision or text encoder once
- Capture the encoder hidden states or latents
- Save them to disk (.pt, .npy, etc.)
- Later skip the encoder entirely and feed those latents back into the decoder
This is extremely useful for:
- Document and text compression
- Conversation or document memory systems
- Fast repeated queries on the same content
- Low latency inference using cached latents instead of re encoding images every time
My question is:
For the DeepSeek-OCR integration in llama.cpp, will there be a way to
- Expose the encoder outputs or latent tensors, and
- Run the model in a “decode only from cached latents” mode, where you load latents and skip the encoder,
or will llama.cpp only support a simple end to end image to text pipeline without access to intermediate encoder states?
Encoder latent workflows are a major use case of DeepSeek-OCR besides normal OCR, and having access to encoder outputs in llama.cpp would make it possible to build compression based and cached query systems on top of it.
@Dogacel FYI: I am using your Dogacel/DeepSeek-OCR-Metal-MPS in development for comparing results. Good job! Thank you
@xsploit I don't think it's that easy. At least , one need to fine-tune the model to understand the vision-version tokens. Just like to teach an English model to speak Chinese.
@xsploit I don't think it's that easy. At least , one need to fine-tune the model to understand the vision-version tokens. Just like to teach an English model to speak Chinese.
You don't need to fine-tune anything for the compression workflows that DeepSeek-OCR already supports. DeepSeek-OCR encodes images into a very small number of "vision tokens" and the decoder already knows how to turn those tokens back into text. That mapping is trained end-to-end as part of the model. As long as you use the encoder outputs exactly the way the model was designed to generate them, no extra training is required.
The compression is built-in. The encoder reduces thousands of text tokens down to 64–400 vision tokens (or more in the Gundam modes). The decoder already understands those compressed tokens because that is literally what it was trained on. This is the whole point of the model: compress → decode.
Think of it like JPEG: the encoder creates compressed bytes, and the decoder is built to decode them. No fine-tuning needed.
The only thing users need is access to the encoder output so they can save the compressed tokens, reuse them later, skip re-encoding, and feed them straight into the decoder. This workflow already works today in the Transformers implementation.
The feature request for llama.cpp is just to expose the encoder output and allow a “decode-from-cached-latents” mode. No finetuning is required for this, because it stays within DeepSeek-OCR’s own encoder->decoder pipeline.
I’ve actually been experimenting with this already using the PyTorch version: encoding once, caching the compressed representation, and reusing it later. The main problem is speed: without FlashAttention 2 on Windows, the encoder pass is very slow, and I don’t really want to move everything to Linux/WSL just to fix that. That’s why I’m hoping the llama.cpp integration can support the same “encode once, decode many times” flow, but with llama.cpp’s usual performance tricks (good CPU/GPU backends, quantization, and possibly FlashAttention or similar where available), so this compression use case is actually practical on normal consumer hardware.
Hello everyone, sorry for the delay, but the work is still in progress, there has been some complications. But I am working hard on finishing the PR.
🎉Good news: We just finished a functional implementation of DeepSeek-OCR for llama.cpp! While it's still imperfect and currently only supports Base mode (1024 x 1024) in CPU backend, @sfallah and I are pushing hard to complete the full feature ASAP!
You’re both doing an exceptional job, well done! 🎉