guidance
guidance copied to clipboard
image is broken for LLaVa models
The bug I am attempting to run LLaVa through LLamacpp and I am getting incorrect responses
To Reproduce
# Imports
import guidance
from guidance import image
from guidance import user, assistant, system
from guidance import gen, select
from guidance import capture, Tool, regex
# Paths
path_miqu = "/llava-v1.6-34b.Q6_K.gguf"
# Models
llama2 = guidance.models.LlamaCpp(path_miqu, n_gpu_layers=-1, n_ctx=5000)
lm = llama2
lm += "USER: What is this an image of?" + image("/forza.png") + "\nASSISTANT:" + gen(stop="</s>")
Output
Alternatively using a different image
System info (please complete the following information):
- OS Ubuntu 22.04
- Guidance Version (1.10)
The model works perfectly in Ollama and generates the proper responses
I attempted to run the same image using llamacpp and it worked successfully
import base64
def image_to_base64_data_uri(file_path):
with open(file_path, "rb") as img_file:
base64_data = base64.b64encode(img_file.read()).decode('utf-8')
return f"data:image/png;base64,{base64_data}"
# Replace 'file_path.png' with the actual path to your PNG file
file_path = '/riverwood.jpg'
data_uri = image_to_base64_data_uri(file_path)
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="/mmproj-model-f16.gguf")
llm = Llama(
model_path="/llava-v1.6-34b.Q6_K.gguf",
chat_handler=chat_handler,
n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
logits_all=True,# needed to make llava work
n_gpu_layers=-1
)
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": data_uri }},
{"type" : "text", "text": "Describe this image in detail please."}
]
}
]
)
Output
The image depicts a scene from what appears to be a video game, given the graphical style and the characters' attire. In the foreground, there is a character wearing armor with a helmet that has horns attached.
I attempted to modify the Guidance LLamaCPP object by passing in the chat handler
# Imports
import guidance
from guidance import image
from guidance import user, assistant, system
from guidance import gen, select
from guidance import capture, Tool, regex
# Paths
path_miqu = "/llava-v1.6-34b.Q6_K.gguf"
# Models
from llama_cpp.llama_chat_format import Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="/CloserModels/mmproj-model-f16.gguf")
llama2 = guidance.models.LlamaCpp(path_miqu, n_gpu_layers=-1, n_ctx=5000, chat_handler = chat_handler) #n_ctx=8192
However this made no difference and the model still failed.
I am facing this same problem. Any update on this issue?
Any updates?
I don't think supporting images for opensource multi-modal models is pursued actively, https://github.com/guidance-ai/guidance/issues/554#issuecomment-1878163908 This support was introduced only for Gemini models.
I have been digging into this a bit. You can pass a Llama model directly into guidance's models.LlamaCpp and use it. However: the image
object that guidance uses seems to be made to support VertexAI's image capabilities; I can't figure out how to pass {"type": "image_url", "image_url": {"url": data_uri }}
to the LlamaCpp model and that's where I'm stuck.
I have been digging into this a bit. You can pass a Llama model directly into guidance's models.LlamaCpp and use it. However: the
image
object that guidance uses seems to be made to support VertexAI's image capabilities; I can't figure out how to pass{"type": "image_url", "image_url": {"url": data_uri }}
to the LlamaCpp model and that's where I'm stuck.
Agreed. I have also been stuck exactly there. If I can figure out how the {"type": "text", "text": "some prompt"}} is converted in the guidance model, I think I can figure out how the image_url can be passed in a similar fashion.
@slundberg Could you please provide us with high-level pointers to start our dev efforts? Thx!