guidance image is broken for LLaVa models

The bug I am attempting to run LLaVa through LLamacpp and I am getting incorrect responses

To Reproduce

# Imports
import guidance
from guidance import image
from guidance import user, assistant, system
from guidance import gen, select
from guidance import capture, Tool, regex

# Paths
path_miqu = "/llava-v1.6-34b.Q6_K.gguf"

# Models
llama2 = guidance.models.LlamaCpp(path_miqu, n_gpu_layers=-1, n_ctx=5000)

lm = llama2 
lm += "USER: What is this an image of?" + image("/forza.png") + "\nASSISTANT:" + gen(stop="</s>")

Output

Alternatively using a different image

System info (please complete the following information):

OS Ubuntu 22.04
Guidance Version (1.10)

The model works perfectly in Ollama and generates the proper responses

Feb 27 '24 23:02 psych0v0yager

I attempted to run the same image using llamacpp and it worked successfully

import base64

def image_to_base64_data_uri(file_path):
    with open(file_path, "rb") as img_file:
        base64_data = base64.b64encode(img_file.read()).decode('utf-8')
        return f"data:image/png;base64,{base64_data}"

# Replace 'file_path.png' with the actual path to your PNG file
file_path = '/riverwood.jpg'
data_uri = image_to_base64_data_uri(file_path)


from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="/mmproj-model-f16.gguf")
llm = Llama(
  model_path="/llava-v1.6-34b.Q6_K.gguf",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
  logits_all=True,# needed to make llava work
  n_gpu_layers=-1
)
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are an assistant who perfectly describes images."},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": data_uri }},
                {"type" : "text", "text": "Describe this image in detail please."}
            ]
        }
    ]
)

Output

The image depicts a scene from what appears to be a video game, given the graphical style and the characters' attire. In the foreground, there is a character wearing armor with a helmet that has horns attached.

I attempted to modify the Guidance LLamaCPP object by passing in the chat handler

# Imports
import guidance
from guidance import image
from guidance import user, assistant, system
from guidance import gen, select
from guidance import capture, Tool, regex


# Paths
path_miqu = "/llava-v1.6-34b.Q6_K.gguf"



# Models
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="/CloserModels/mmproj-model-f16.gguf")
llama2 = guidance.models.LlamaCpp(path_miqu, n_gpu_layers=-1, n_ctx=5000, chat_handler = chat_handler) #n_ctx=8192

However this made no difference and the model still failed.

Feb 28 '24 00:02 psych0v0yager

I am facing this same problem. Any update on this issue?

Mar 13 '24 19:03 ishrat-tl

Any updates?

Apr 19 '24 07:04 eliranwong

I don't think supporting images for opensource multi-modal models is pursued actively, https://github.com/guidance-ai/guidance/issues/554#issuecomment-1878163908 This support was introduced only for Gemini models.

Apr 19 '24 17:04 ishrat-tl

I have been digging into this a bit. You can pass a Llama model directly into guidance's models.LlamaCpp and use it. However: the image object that guidance uses seems to be made to support VertexAI's image capabilities; I can't figure out how to pass {"type": "image_url", "image_url": {"url": data_uri }} to the LlamaCpp model and that's where I'm stuck.

Apr 30 '24 21:04 D-Vaillant

I have been digging into this a bit. You can pass a Llama model directly into guidance's models.LlamaCpp and use it. However: the image object that guidance uses seems to be made to support VertexAI's image capabilities; I can't figure out how to pass {"type": "image_url", "image_url": {"url": data_uri }} to the LlamaCpp model and that's where I'm stuck.

Agreed. I have also been stuck exactly there. If I can figure out how the {"type": "text", "text": "some prompt"}} is converted in the guidance model, I think I can figure out how the image_url can be passed in a similar fashion.

May 01 '24 12:05 ishrat-tl

@slundberg Could you please provide us with high-level pointers to start our dev efforts? Thx!

May 01 '24 12:05 ishrat-tl

guidance guidance copied to clipboard

image is broken for LLaVa models

guidance
guidance copied to clipboard