Research issue: gather examples of multi-modal API calls from different LLMs
To aid in the design for both of these:
- #331
- #556
I'm going to gather a bunch of examples of how different LLMs accept multi-modal inputs. I'm particularly interested in the following:
- What kind of files do they accept?
- Do they accept file uploads, base64 inline files, URL references or a selection?
- How are these interspersed with text prompts? This will help inform the database schema design for #556
- If included with a text prompt does it go before or after the files?
- How many files can be attached at once?
- Is extra information such as the mimetype needed? If so, this helps inform how the CLI design looks (can I do
--file filename.extor do I need some other mechanism that helps provide the type as well?)
Simple GPT-4o example from https://simonwillison.net/2024/Aug/25/covidsewage-alt-text/
import base64, openai
client = openai.OpenAI()
with open("/tmp/covid.png", "rb") as image_file:
encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
messages = [
{
"role": "system",
"content": "Return the concentration levels in the sewersheds - single paragraph, no markdown",
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "data:image/png;base64," + encoded_image},
}
],
},
]
completion = client.chat.completions.create(model="gpt-4o", messages=messages)
print(completion.choices[0].message.content)
Claude image example from https://github.com/simonw/tools/blob/0249ab83775861f549abb1aa80af0ca3614dc5ff/haiku.html
const requestBody = {
model: "claude-3-haiku-20240307",
max_tokens: 1024,
messages: [
{
role: "user",
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/jpeg",
data: base64Image,
},
},
{ type: "text", text: "Return a haiku inspired by this image" },
],
},
],
};
fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: {
"x-api-key": apiKey,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
"anthropic-dangerous-direct-browser-access": "true"
},
body: JSON.stringify(requestBody),
})
.then((response) => response.json())
.then((data) => {
console.log(JSON.stringify(data, null, 2));
const haiku = data.content[0].text;
responseElement.innerText += haiku + "\n\n";
})
.catch((error) => {
console.error("Error sending image to the Anthropic API:", error);
})
.finally(() => {
// Hide "Generating..." message
generatingElement.style.display = "none";
});
Basic Gemini example from https://github.com/simonw/llm-gemini/blob/4195c4396834e5bccc3ce9a62647591e1b228e2e/llm_gemini.py (my images branch):
messages = []
if conversation:
for response in conversation.responses:
messages.append(
{"role": "user", "parts": [{"text": response.prompt.prompt}]}
)
messages.append({"role": "model", "parts": [{"text": response.text()}]})
if prompt.images:
for image in prompt.images:
messages.append(
{
"role": "user",
"parts": [
{
"inlineData": {
"mimeType": "image/jpeg",
"data": base64.b64encode(image.read()).decode(
"utf-8"
),
}
}
],
}
)
messages.append({"role": "user", "parts": [{"text": prompt.prompt}]})
Example from Google AI Studio:
API_KEY="YOUR_API_KEY"
# TODO: Make the following files available on the local file system.
FILES=("image.jpg")
MIME_TYPES=("image/jpeg")
for i in "${!FILES[@]}"; do
NUM_BYTES=$(wc -c < "${FILES[$i]}")
curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${API_KEY}" \
-H "X-Goog-Upload-Command: start, upload, finalize" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPES[$i]}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${FILES[$i]}'}}" \
--data-binary "@${FILES[$i]}"
# TODO: Read the file.uri from the response, store it as FILE_URI_${i}
done
# Adjust safety settings in generationConfig below.
# See https://ai.google.dev/gemini-api/docs/safety-settings
curl \
-X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-pro-exp-0801:generateContent?key=${API_KEY} \
-H 'Content-Type: application/json' \
-d @<(echo '{
"contents": [
{
"role": "user",
"parts": [
{
"fileData": {
"fileUri": "${FILE_URI_0}",
"mimeType": "image/jpeg"
}
}
]
},
{
"role": "user",
"parts": [
{
"text": "Describe image in detail"
}
]
}
],
"generationConfig": {
"temperature": 1,
"topK": 64,
"topP": 0.95,
"maxOutputTokens": 8192,
"responseMimeType": "text/plain"
}
}')
Here's Gemini Pro accepting multiple images at once: https://ai.google.dev/gemini-api/docs/vision?lang=python#prompt-multiple
import PIL.Image
sample_file = PIL.Image.open('sample.jpg')
sample_file_2 = PIL.Image.open('piranha.jpg')
sample_file_3 = PIL.Image.open('firefighter.jpg')
model = genai.GenerativeModel(model_name="gemini-1.5-pro")
prompt = (
"Write an advertising jingle showing how the product in the first image "
"could solve the problems shown in the second two images."
)
response = model.generate_content([prompt, sample_file, sample_file_2, sample_file_3])
print(response.text)
It says:
When the combination of files and system instructions that you intend to send is larger than 20MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API:
I just saw Gemini has been trained to returning bounding boxes. https://ai.google.dev/gemini-api/docs/vision?lang=python#bbox
I tried this:
>>> import google.generativeai as genai
>>> genai.configure(api_key="...")
>>> model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")
>>> import PIL.Image
>>> pelicans = PIL.Image.open('/tmp/pelicans.jpeg')
>>> prompt = 'Return bounding boxes for every pelican in this photo - for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([pelicans, prompt])
>>> print(response.text)
I found the following bounding boxes:
- [488, 945, 519, 999]
- [460, 259, 487, 307]
- [472, 574, 498, 612]
- [459, 431, 483, 476]
- [530, 519, 555, 560]
- [445, 733, 470, 769]
- [493, 805, 516, 850]
- [418, 545, 441, 581]
- [400, 428, 425, 466]
- [593, 519, 616, 543]
- [428, 93, 451, 135]
- [431, 224, 456, 266]
- [586, 941, 609, 964]
- [602, 711, 623, 735]
- [397, 500, 419, 535]
I could not find any other pelicans in this image.
Against this photo:
It got 15 - I count 20.
I don't think those bounding boxes are in the right places. I built a Claude Artifact to render them, and I may not have built it right, but I got this:
Code here: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool.html
Transcript: https://gist.github.com/simonw/40ff639e96d55a1df7ebfa7db1974b92
Tried it again with this photo of goats and got slightly more convincing result:
>>> goats = PIL.Image.open("/tmp/goats.jpeg")
>>> prompt = 'Return bounding boxes around every goat, for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([goats, prompt])
print(response.text)
>>> print(response.text)
- 200 90 745 527 goat
- 300 610 904 937 goat
Oh! I tried different varieties of coordinate and it turned out this one rendered correctly:
[255, 473, 800, 910]
[96, 63, 700, 390]
Rendered:
I mucked around a bunch and came up with this, which seems to work: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool-fixed.html
It does a better job with the pelicans, though clearly those boxes aren't right. The goats are spot on though!
Fun, with this heron it found the reflection too:
>>> heron = PIL.Image.open("/tmp/heron.jpeg")
>>> prompt = 'Return bounding boxes around every heron, [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([heron, prompt])
>>> print(response.text)
- [431, 478, 625, 575]
- [224, 493, 411, 606]
Based on all of that, I built this tool: https://tools.simonwillison.net/gemini-bbox
You have to paste in a Gemini API key when you use it, which gets stashed in localStorage (like my Haiku tool).
See full blog post here: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/
I'd like to run an image model in llama-cpp-python - this one would be good: https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/tree/main
The docs at https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models seem to want a path to a CLIP model though, which I'm not sure how to obtain.
https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf would be a good one to figure out the Python / llama-cpp-python recipe for too.
I'd like to run an image model in
llama-cpp-python- this one would be good: https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/tree/mainThe docs at https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models seem to want a path to a CLIP model though, which I'm not sure how to obtain.
According to perplexity.ai "the mmproj model is essentially equivalent to the CLIP model in the context of llama-cpp-python and GGUF (GGML Unified Format) files for multimodal models like LLaVA and minicpm2.6"
https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/resolve/main/mmproj-model-f16.gguf?download=true
It appears the underlying embedding model used is google/siglip-base-patch16-224
i have used MiniCPM-V-2_6- with bleeding edge llama.cpp and it works quite well
ffmpeg -i ./clip.mp4 \
-vf fps=1/3,scale=480:480:force_original_aspect_ratio=decrease \
-q:v 2 ./f/frame_%04d.jpg
./llama-minicpmv-cli \
-m ./mini2.6/ggml-model-Q4_K_M.gguf \
--mmproj ./mini2.6/mmproj-model-f16.gguf \
--image ./f/frame_0001.jpg \
--image ./f/frame_0002.jpg \
--image ./f/frame_0003.jpg \
--image ./f/frame_0004.jpg \
--temp 0.1 \
-p "describe the images in detail in english language" \
-c 4096
wow it appears this functionality just got added to llama-cpp-python just yesterday. eagerly looking forward to MiniCPM-V-2_6-gguf as a supported llm multimodal model
https://github.com/abetlen/llama-cpp-python/commit/ad2deafa8d615e9eaf0f8c3976e465fb1a3ea15f
@simonw
I tried this newest 2.90 version of llama-cpp-python and it works! Instead of ggml-model-f16.gguf you can use ggml-model-Q4_K_M.gguf if you prefer
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler
chat_handler = MiniCPMv26ChatHandler.from_pretrained(
repo_id="openbmb/MiniCPM-V-2_6-gguf",
filename="*mmproj*",
)
llm = Llama.from_pretrained(
repo_id="openbmb/MiniCPM-V-2_6-gguf",
filename="ggml-model-f16.gguf",
chat_handler=chat_handler,
n_ctx=4096, # n_ctx should be increased to accommodate the image embedding
)
response = llm.create_chat_completion(
messages = [
{
"role": "user",
"content": [
{"type" : "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
]
}
]
)
print(response["choices"][0])
print(response["choices"][0]["message"]["content"])
Thank you! That’s exactly what I needed to know.
ollama 0.3.10, captured HTTP conversation to /api/chat via the ollama CLI client, prompt was: "/tmp/image.jpg OCR the text from the image."
POST /api/chat HTTP/1.1
Host: 127.0.0.1:11434
User-Agent: ollama/0.3.10 (amd64 linux) Go/go1.22.5
Content-Length: 1370164
Accept: application/x-ndjson
Content-Type: application/json
Accept-Encoding: gzip
{"model":"minicpm-v","messages":[{"role":"user","content":" OCR the text from the image.","images":["/9j/2wC<truncated base64>/9k="]}],"format":"","options":{}}
same JSON pretty-printed:
{
"model":"minicpm-v",
"messages":[
{
"role":"user",
"content":" OCR the text from the image.",
"images":[
"/9j/2wC<truncated base64>/9k="
]
}
],
"format":"",
"options":{
}
}
This research informed the attachments feature shipped in:
- #590