llama.cpp Feature Request: add support to LLaVA OneVision

Feature Request: add support to LLaVA OneVision

Open alexrah opened this issue 6 months ago • 0 comments

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

LLaVA has a new version called OneVision which was released 2024/08/06

LLaVA OnVision uses SO400M as the vision encoder and Qwen-2.0 as the language model, with trainable components including a projector and the full model in later stages.

I'm no expert but as I understand, the architecture is similar the the previous versions, but both vision encoder and the language model are different

llama.cpp LLaVA support: https://github.com/ggerganov/llama.cpp/tree/master/examples/llava

Motivation

compared to the current supported LLaVa 1.6, it provide the following features:

Supports various input resolutions up to 2304 * 2304 pixels.
Single image input is represented by 729 * (9+1) tokens at most under anyres_max_9 mode.
Supports multi-image and video inputs. Multi-image input is represented by 729 token for each image, and video input is represented by 196 token for each frame.
Available in three sizes: 0.5B, 7B and 72B parameter versions, fit for different memory and inference latency requirements.
better support for Set-of-mark prompting
and more...

Possible Implementation

No response

Aug 09 '24 08:08 alexrah

llama.cpp llama.cpp copied to clipboard

Feature Request: add support to LLaVA OneVision

Prerequisites

Feature Description

Motivation

Possible Implementation

llama.cpp
llama.cpp copied to clipboard