Feature Request: Support for ERNIE-4.5-VL
Prerequisites
- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
I would like https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT multimodal support.
Motivation
I couldn't find any discussion or issue for this, but that's the best open source model I could find for OCRing hand written Japanese and Chinese text that actually kind of works. It's worse than OpenAIs recognition, but in my 3 test images I use to evaluate OCR capabilities of open source models it performed ok (ok is better than everything else I tested) It's better than
- gemma
- qwen
- intern
- lfm2
- kimi (I think I tested mimo, but I can't find my setup or results... so maybe mimo is ok too?) ... and every other open model I could find.
Possible Implementation
https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf https://github.com/bigdavidone/ERNIE4_5 https://github.com/vllm-project/vllm/pull/20220
I'll try, but I might need help from some competent people (@CISC @ngxson) because the model has some pretty atypical tensor configurations (there are double expert layers basically, a big and small expert for each layer and there's a nontrivial mixing algorithm for them. I'll share my branch where I managed to get the easy part (mmproj extraction) done.
Also, just FYI: vLLM does not support the visual component yet, for exactly the same reason: the implementation is very complex. As far as I know, the support for the visual component with those pesky tensors is only available on the PaddlePaddle native Ernie implementation.
Also, just FYI: vLLM does not support the visual component yet, for exactly the same reason: the implementation is very complex. As far as I know, the support for the visual component with those pesky tensors is only available on the PaddlePaddle native Ernie implementation.
Should have taken a closer look at this. Thank you very much for taking a stab at this
Ernie4.5 VL Model vLLM now supported https://github.com/vllm-project/vllm/pull/22514
Also, just FYI: vLLM does not support the visual component yet, for exactly the same reason: the implementation is very complex. As far as I know, the support for the visual component with those pesky tensors is only available on the PaddlePaddle native Ernie implementation.
This issue was closed because it has been inactive for 14 days since being marked as stale.
same request, this is an incredible model worth supporting