llama.cpp
llama.cpp copied to clipboard
Feature Request: add support to LLaVA OneVision
Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the README.md.
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
LLaVA has a new version called OneVision which was released 2024/08/06
HuggingFace GitHub Release Notes
LLaVA OnVision uses SO400M as the vision encoder and Qwen-2.0 as the language model, with trainable components including a projector and the full model in later stages.
I'm no expert but as I understand, the architecture is similar the the previous versions, but both vision encoder and the language model are different
llama.cpp LLaVA support: https://github.com/ggerganov/llama.cpp/tree/master/examples/llava
Motivation
compared to the current supported LLaVa 1.6, it provide the following features:
- Supports various input resolutions up to 2304 * 2304 pixels.
- Single image input is represented by 729 * (9+1) tokens at most under anyres_max_9 mode.
- Supports multi-image and video inputs. Multi-image input is represented by 729 token for each image, and video input is represented by 196 token for each frame.
- Available in three sizes: 0.5B, 7B and 72B parameter versions, fit for different memory and inference latency requirements.
- better support for Set-of-mark prompting
- and more...
Possible Implementation
No response