llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: Support for ERNIE-4.5-VL

Open Som-anon opened this issue 3 months ago • 6 comments

Prerequisites

  • [x] I am running the latest code. Mention the version if possible as well.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

I would like https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT multimodal support.

Motivation

I couldn't find any discussion or issue for this, but that's the best open source model I could find for OCRing hand written Japanese and Chinese text that actually kind of works. It's worse than OpenAIs recognition, but in my 3 test images I use to evaluate OCR capabilities of open source models it performed ok (ok is better than everything else I tested) It's better than

  • gemma
  • qwen
  • intern
  • lfm2
  • kimi (I think I tested mimo, but I can't find my setup or results... so maybe mimo is ok too?) ... and every other open model I could find.

Possible Implementation

https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf https://github.com/bigdavidone/ERNIE4_5 https://github.com/vllm-project/vllm/pull/20220

Som-anon avatar Aug 22 '25 19:08 Som-anon

I'll try, but I might need help from some competent people (@CISC @ngxson) because the model has some pretty atypical tensor configurations (there are double expert layers basically, a big and small expert for each layer and there's a nontrivial mixing algorithm for them. I'll share my branch where I managed to get the easy part (mmproj extraction) done.

pwilkin avatar Aug 23 '25 12:08 pwilkin

Also, just FYI: vLLM does not support the visual component yet, for exactly the same reason: the implementation is very complex. As far as I know, the support for the visual component with those pesky tensors is only available on the PaddlePaddle native Ernie implementation.

pwilkin avatar Aug 23 '25 19:08 pwilkin

Also, just FYI: vLLM does not support the visual component yet, for exactly the same reason: the implementation is very complex. As far as I know, the support for the visual component with those pesky tensors is only available on the PaddlePaddle native Ernie implementation.

Should have taken a closer look at this. Thank you very much for taking a stab at this

Som-anon avatar Aug 25 '25 15:08 Som-anon

Ernie4.5 VL Model vLLM now supported https://github.com/vllm-project/vllm/pull/22514

Also, just FYI: vLLM does not support the visual component yet, for exactly the same reason: the implementation is very complex. As far as I know, the support for the visual component with those pesky tensors is only available on the PaddlePaddle native Ernie implementation.

m1namuci avatar Aug 27 '25 09:08 m1namuci

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Oct 11 '25 01:10 github-actions[bot]

same request, this is an incredible model worth supporting

04RR avatar Nov 13 '25 17:11 04RR