llm Research issue: gather examples of multi-modal API calls from different LLMs

To aid in the design for both of these:

#331
#556

I'm going to gather a bunch of examples of how different LLMs accept multi-modal inputs. I'm particularly interested in the following:

What kind of files do they accept?
Do they accept file uploads, base64 inline files, URL references or a selection?
How are these interspersed with text prompts? This will help inform the database schema design for #556
If included with a text prompt does it go before or after the files?
How many files can be attached at once?
Is extra information such as the mimetype needed? If so, this helps inform how the CLI design looks (can I do --file filename.ext or do I need some other mechanism that helps provide the type as well?)

Aug 26 '24 00:08 simonw

Simple GPT-4o example from https://simonwillison.net/2024/Aug/25/covidsewage-alt-text/

import base64, openai

client = openai.OpenAI()
with open("/tmp/covid.png", "rb") as image_file:
    encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
messages = [
    {
        "role": "system",
        "content": "Return the concentration levels in the sewersheds - single paragraph, no markdown",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "data:image/png;base64," + encoded_image},
            }
        ],
    },
]
completion = client.chat.completions.create(model="gpt-4o", messages=messages)
print(completion.choices[0].message.content)

Aug 26 '24 00:08 simonw

Claude image example from https://github.com/simonw/tools/blob/0249ab83775861f549abb1aa80af0ca3614dc5ff/haiku.html

        const requestBody = {
          model: "claude-3-haiku-20240307",
          max_tokens: 1024,
          messages: [
            {
              role: "user",
              content: [
                {
                  type: "image",
                  source: {
                    type: "base64",
                    media_type: "image/jpeg",
                    data: base64Image,
                  },
                },
                { type: "text", text: "Return a haiku inspired by this image" },
              ],
            },
          ],
        };
        fetch("https://api.anthropic.com/v1/messages", {
          method: "POST",
          headers: {
            "x-api-key": apiKey,
            "anthropic-version": "2023-06-01",
            "content-type": "application/json",
            "anthropic-dangerous-direct-browser-access": "true"
          },
          body: JSON.stringify(requestBody),
        })
          .then((response) => response.json())
          .then((data) => {
            console.log(JSON.stringify(data, null, 2));
            const haiku = data.content[0].text;
            responseElement.innerText += haiku + "\n\n";
          })
          .catch((error) => {
            console.error("Error sending image to the Anthropic API:", error);
          })
          .finally(() => {
            // Hide "Generating..." message
            generatingElement.style.display = "none";
          });

Aug 26 '24 00:08 simonw

Basic Gemini example from https://github.com/simonw/llm-gemini/blob/4195c4396834e5bccc3ce9a62647591e1b228e2e/llm_gemini.py (my images branch):

        messages = []
        if conversation:
            for response in conversation.responses:
                messages.append(
                    {"role": "user", "parts": [{"text": response.prompt.prompt}]}
                )
                messages.append({"role": "model", "parts": [{"text": response.text()}]})
        if prompt.images:
            for image in prompt.images:
                messages.append(
                    {
                        "role": "user",
                        "parts": [
                            {
                                "inlineData": {
                                    "mimeType": "image/jpeg",
                                    "data": base64.b64encode(image.read()).decode(
                                        "utf-8"
                                    ),
                                }
                            }
                        ],
                    }
                )
        messages.append({"role": "user", "parts": [{"text": prompt.prompt}]})

Aug 26 '24 00:08 simonw

Example from Google AI Studio:

API_KEY="YOUR_API_KEY"

# TODO: Make the following files available on the local file system.
FILES=("image.jpg")
MIME_TYPES=("image/jpeg")
for i in "${!FILES[@]}"; do
  NUM_BYTES=$(wc -c < "${FILES[$i]}")
  curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${API_KEY}" \
    -H "X-Goog-Upload-Command: start, upload, finalize" \
    -H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
    -H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPES[$i]}" \
    -H "Content-Type: application/json" \
    -d "{'file': {'display_name': '${FILES[$i]}'}}" \
    --data-binary "@${FILES[$i]}"
  # TODO: Read the file.uri from the response, store it as FILE_URI_${i}
done

# Adjust safety settings in generationConfig below.
# See https://ai.google.dev/gemini-api/docs/safety-settings
curl \
  -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-pro-exp-0801:generateContent?key=${API_KEY} \
  -H 'Content-Type: application/json' \
  -d @<(echo '{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "fileUri": "${FILE_URI_0}",
            "mimeType": "image/jpeg"
          }
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "Describe image in detail"
        }
      ]
    }
  ],
  "generationConfig": {
    "temperature": 1,
    "topK": 64,
    "topP": 0.95,
    "maxOutputTokens": 8192,
    "responseMimeType": "text/plain"
  }
}')

Aug 26 '24 00:08 simonw

Here's Gemini Pro accepting multiple images at once: https://ai.google.dev/gemini-api/docs/vision?lang=python#prompt-multiple

import PIL.Image

sample_file = PIL.Image.open('sample.jpg')
sample_file_2 = PIL.Image.open('piranha.jpg')
sample_file_3 = PIL.Image.open('firefighter.jpg')

model = genai.GenerativeModel(model_name="gemini-1.5-pro")

prompt = (
  "Write an advertising jingle showing how the product in the first image "
  "could solve the problems shown in the second two images."
)

response = model.generate_content([prompt, sample_file, sample_file_2, sample_file_3])

print(response.text)

It says:

When the combination of files and system instructions that you intend to send is larger than 20MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API:

Aug 26 '24 03:08 simonw

I just saw Gemini has been trained to returning bounding boxes. https://ai.google.dev/gemini-api/docs/vision?lang=python#bbox

I tried this:

>>> import google.generativeai as genai
>>> genai.configure(api_key="...")
>>> model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")
>>> import PIL.Image
>>> pelicans = PIL.Image.open('/tmp/pelicans.jpeg')
>>> prompt = 'Return bounding boxes for every pelican in this photo - for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([pelicans, prompt])
>>> print(response.text)
I found the following bounding boxes:
- [488, 945, 519, 999]
- [460, 259, 487, 307]
- [472, 574, 498, 612]
- [459, 431, 483, 476]
- [530, 519, 555, 560]
- [445, 733, 470, 769]
- [493, 805, 516, 850]
- [418, 545, 441, 581]
- [400, 428, 425, 466]
- [593, 519, 616, 543]
- [428, 93, 451, 135]
- [431, 224, 456, 266]
- [586, 941, 609, 964]
- [602, 711, 623, 735]
- [397, 500, 419, 535]
I could not find any other pelicans in this image.

Against this photo:

pelicans

It got 15 - I count 20.

Aug 26 '24 03:08 simonw

I don't think those bounding boxes are in the right places. I built a Claude Artifact to render them, and I may not have built it right, but I got this:

CleanShot 2024-08-25 at 20 27 28@2x

Code here: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool.html

Transcript: https://gist.github.com/simonw/40ff639e96d55a1df7ebfa7db1974b92

Aug 26 '24 03:08 simonw

Tried it again with this photo of goats and got slightly more convincing result:

CleanShot 2024-08-25 at 20 31 40@2x

goats

>>> goats = PIL.Image.open("/tmp/goats.jpeg")
>>> prompt = 'Return bounding boxes around every goat, for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([goats, prompt])
print(response.text)
>>> print(response.text)
- 200 90 745 527 goat
- 300 610 904 937 goat

Aug 26 '24 03:08 simonw

Oh! I tried different varieties of coordinate and it turned out this one rendered correctly:

[255, 473, 800, 910]
[96, 63, 700, 390]

Rendered:

CleanShot 2024-08-25 at 20 40 03@2x

Aug 26 '24 03:08 simonw

I mucked around a bunch and came up with this, which seems to work: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool-fixed.html

It does a better job with the pelicans, though clearly those boxes aren't right. The goats are spot on though!

CleanShot 2024-08-25 at 20 58 49@2x

Aug 26 '24 03:08 simonw

Fun, with this heron it found the reflection too:

CleanShot 2024-08-25 at 21 01 56@2x

heron

>>> heron = PIL.Image.open("/tmp/heron.jpeg")
>>> prompt = 'Return bounding boxes around every heron, [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([heron, prompt])
>>> print(response.text)
- [431, 478, 625, 575]
- [224, 493, 411, 606]

Aug 26 '24 04:08 simonw

Based on all of that, I built this tool: https://tools.simonwillison.net/gemini-bbox

You have to paste in a Gemini API key when you use it, which gets stashed in localStorage (like my Haiku tool).

CleanShot 2024-08-25 at 21 20 06@2x

See full blog post here: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/

Aug 26 '24 04:08 simonw

I'd like to run an image model in llama-cpp-python - this one would be good: https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/tree/main

The docs at https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models seem to want a path to a CLIP model though, which I'm not sure how to obtain.

Aug 26 '24 19:08 simonw

https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf would be a good one to figure out the Python / llama-cpp-python recipe for too.

Aug 26 '24 22:08 simonw

I'd like to run an image model in llama-cpp-python - this one would be good: https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/tree/main

The docs at https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models seem to want a path to a CLIP model though, which I'm not sure how to obtain.

According to perplexity.ai "the mmproj model is essentially equivalent to the CLIP model in the context of llama-cpp-python and GGUF (GGML Unified Format) files for multimodal models like LLaVA and minicpm2.6"

https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/resolve/main/mmproj-model-f16.gguf?download=true

It appears the underlying embedding model used is google/siglip-base-patch16-224

Aug 29 '24 21:08 saket424

i have used MiniCPM-V-2_6- with bleeding edge llama.cpp and it works quite well

ffmpeg -i ./clip.mp4 \
  -vf fps=1/3,scale=480:480:force_original_aspect_ratio=decrease \
  -q:v 2 ./f/frame_%04d.jpg

./llama-minicpmv-cli \
  -m ./mini2.6/ggml-model-Q4_K_M.gguf \
  --mmproj ./mini2.6/mmproj-model-f16.gguf  \
  --image ./f/frame_0001.jpg \
  --image ./f/frame_0002.jpg \
  --image ./f/frame_0003.jpg \
  --image ./f/frame_0004.jpg \
  --temp 0.1 \
  -p "describe the images in detail in english language" \
  -c 4096

Aug 29 '24 21:08 saket424

wow it appears this functionality just got added to llama-cpp-python just yesterday. eagerly looking forward to MiniCPM-V-2_6-gguf as a supported llm multimodal model

https://github.com/abetlen/llama-cpp-python/commit/ad2deafa8d615e9eaf0f8c3976e465fb1a3ea15f

Aug 29 '24 21:08 saket424

@simonw

I tried this newest 2.90 version of llama-cpp-python and it works! Instead of ggml-model-f16.gguf you can use ggml-model-Q4_K_M.gguf if you prefer

from llama_cpp import Llama
from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler

chat_handler = MiniCPMv26ChatHandler.from_pretrained(
  repo_id="openbmb/MiniCPM-V-2_6-gguf",
  filename="*mmproj*",
)

llm = Llama.from_pretrained(
  repo_id="openbmb/MiniCPM-V-2_6-gguf",
  filename="ggml-model-f16.gguf",
  chat_handler=chat_handler,
  n_ctx=4096, # n_ctx should be increased to accommodate the image embedding
)

response = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

            ]
        }
    ]
)
print(response["choices"][0])
print(response["choices"][0]["message"]["content"])

Aug 29 '24 22:08 saket424

Thank you! That’s exactly what I needed to know.

Aug 29 '24 23:08 simonw

ollama 0.3.10, captured HTTP conversation to /api/chat via the ollama CLI client, prompt was: "/tmp/image.jpg OCR the text from the image."

POST /api/chat HTTP/1.1
Host: 127.0.0.1:11434
User-Agent: ollama/0.3.10 (amd64 linux) Go/go1.22.5
Content-Length: 1370164
Accept: application/x-ndjson
Content-Type: application/json
Accept-Encoding: gzip

{"model":"minicpm-v","messages":[{"role":"user","content":"  OCR the text from the image.","images":["/9j/2wC<truncated base64>/9k="]}],"format":"","options":{}}

same JSON pretty-printed:

{
    "model":"minicpm-v",
    "messages":[
        {
            "role":"user",
            "content":"  OCR the text from the image.",
            "images":[
                "/9j/2wC<truncated base64>/9k="
            ]
        }
    ],
    "format":"",
    "options":{
    }
}

Sep 16 '24 12:09 helix84

This research informed the attachments feature shipped in:

#590

Apr 08 '25 15:04 simonw