llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: add support to LLaVA OneVision

Open alexrah opened this issue 6 months ago • 0 comments

Prerequisites

  • [X] I am running the latest code. Mention the version if possible as well.
  • [X] I carefully followed the README.md.
  • [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

LLaVA has a new version called OneVision which was released 2024/08/06

HuggingFace GitHub Release Notes

LLaVA OnVision uses SO400M as the vision encoder and Qwen-2.0 as the language model, with trainable components including a projector and the full model in later stages.

I'm no expert but as I understand, the architecture is similar the the previous versions, but both vision encoder and the language model are different

llama.cpp LLaVA support: https://github.com/ggerganov/llama.cpp/tree/master/examples/llava

Motivation

compared to the current supported LLaVa 1.6, it provide the following features:

  • Supports various input resolutions up to 2304 * 2304 pixels.
  • Single image input is represented by 729 * (9+1) tokens at most under anyres_max_9 mode.
  • Supports multi-image and video inputs. Multi-image input is represented by 729 token for each image, and video input is represented by 196 token for each frame.
  • Available in three sizes: 0.5B, 7B and 72B parameter versions, fit for different memory and inference latency requirements.
  • better support for Set-of-mark prompting
  • and more...

Possible Implementation

No response

alexrah avatar Aug 09 '24 08:08 alexrah