llm icon indicating copy to clipboard operation
llm copied to clipboard

Multi-modal support for vision models such as GPT-4 vision

Open cmungall opened this issue 2 years ago • 44 comments

https://platform.openai.com/docs/guides/vision

I think this is best handled by command line options --image and --image-urls to either encode and pass as base64, or to pass a URL.

cmungall avatar Nov 07 '23 00:11 cmungall

Indeed this would be awesome. Does it require changes to llm or can it be done in a plugin?

tomviner avatar Nov 08 '23 00:11 tomviner

I suspect we'll be seeing more multimodal models so inclusion in core makes sense, but I defer to @simonw on this!

cmungall avatar Nov 08 '23 00:11 cmungall

I've been thinking about this a lot.

The challenge here is that we need to be able to mix both text and images together in the same prompt - because you can call GPT-4 vision with this kind of thing:

Take a look at this image:

<image 1>

Now compare it to this:

<image 2>

My first instinct was to support syntax like this:

llm -m gpt-4-vision \
  "Take a look at this image:" \
  -i image1.jpeg \
  "Now compare it to this:" \
  -i https://example.com/image2.png

Note that the -i/--image option here takes a filename or a URL, detecting files by seeing if they correspond to files on disk.

But... I don't think I can implement this, because Click really, really doesn't want to provide a mechanism for storing and retrieving the order of different arguments and parameters relative to each other:

  • https://github.com/pallets/click/issues/567
  • https://github.com/pallets/click/issues/1427

I spent some time trying to get this to work with a custom Click command class and parse_args() but determined that I'd effectively have to re-implement the whole Click argument parser from scratch to handle cases like --enable-logging boolean flags and -p key value multi-value parameters. This doesn't feel worthwhile to me.

So now I'm considering the following instead:

llm "look at this image" -i image.jpeg --tbc
llm -c "and compare it with" -i https://example.com/image.png

The trick here is that new --tbc flag, which stands for "to be continued". It causes the prompt to be stored but NOT executed against he model yet - instead, any following llm -c calls can be used to stack up more context in the prompt which will be executed the first time --tbc is NOT used.

On a related note: llm chat could also support this - maybe letting you do this kind of thing:

llm chat -m gpt-4-vision
look at this image
!image image.jpeg

For multi-lined chats you would use the existing !multi command:

llm chat -m gpt-4-vision
!multi
look at this image
!image image.jpeg
and compare it with
!image https://example.com/image.png
!end

simonw avatar Nov 08 '23 03:11 simonw

Crucially, I want to leave the door open for other LLM models provided by plugins - like maybe https://github.com/SkunkworksAI/BakLLaVA - to also support multi-modal inputs like this.

So I think the model class would have a supports_images = True property it could set on to tell LLM that images are supported - otherwise using -i/--image would return an error.

simonw avatar Nov 08 '23 03:11 simonw

One note about the --tbc thing is that we can get basic image support working without it - we could implement this and say that support for multiple images in the same prompt is coming later:

llm -m gpt-4-vision "Caption for this image" -i image.jpeg

simonw avatar Nov 08 '23 03:11 simonw

This work is blocked on:

  • #325

simonw avatar Nov 08 '23 03:11 simonw

Would be amazing to get this working with a Bakllava local model - relevant example code using llama.cpp here https://github.com/cocktailpeanut/mirror/blob/main/app.py

simonw avatar Nov 08 '23 20:11 simonw

Another claimed bakllava example (not tried it yet), this one using llama-cpp-python: https://advanced-stack.com/resources/multi-modalities-inference-using-mistral-ai-llava-bakllava-and-llama-cpp.html

[Actually uses from llm_core.llm import LLaVACPPModel ; Trying to run the example code on my MacBook Pro M2 16GB and it just falls over...; other chat models of a similar size seem to work okay.)

psychemedia avatar Nov 13 '23 09:11 psychemedia

@simonw how about f-strings/templating style?

llm "look at this image {src_image} and compare it to {compare_image}" \
    --infile src_image=sample.jpeg --infile compare_image=known.jpeg
def _infiles_to_dict(
        ctx: click.Context, attribute: click.Option, infiles: tuple[str, ...]) -> dict[str, str]:
     return {k:v for k,v in (f.split("=") for f in infiles)}
@click.command()
@click.option(
    "-i",
    "--infile",
    multiple=True,
    callback=_infiles_to_dict,
    help="Input files in the form key=filename. Multiple files can be included."
)

Misc thoughts:

  • I do like the --tbc idea as well.
  • --image makes sense for now, but later might change to --infile when they can take audio, video, random multi-modal documents? The model would have to specify what formats it accepts? Then the prompt might have to be `llm --infile {video.mp4:v} unless some auto-detection for file format is done.

neomanic avatar Dec 04 '23 04:12 neomanic

https://github.com/tbckr/sgpt

SGPT additionally facilitates the utilization of the GPT-4 Vision API. Include input images using the -i or --input flag, supporting both URLs and local images.

$ sgpt -m "gpt-4-vision-preview" -i "https://upload.wikimedia.org/wikipedia/en/c/cb/Marvin_%28HHGG%29.jpg" "what can you see on the picture?"
The image shows a figure resembling a robot with a humanoid form. It has a
$ sgpt -m "gpt-4-vision-preview" -i pkg/fs/testdata/marvin.jpg "what can you see on the picture?"
The image shows a figure resembling a robot with a sleek, metallic surface. It

It is also possible to combine URLs and local images:

$ sgpt -m "gpt-4-vision-preview" -i "https://upload.wikimedia.org/wikipedia/en/c/cb/Marvin_%28HHGG%29.jpg" -i pkg/fs/testdata/marvin.jpg "what is the difference between those two pictures"
The two images provided appear to be identical. Both show the same depiction of a

NightMachinery avatar Dec 15 '23 13:12 NightMachinery

I built a prototype of this today, in the image-experimental branch - just for OpenAI so far using docs on https://platform.openai.com/docs/guides/vision but I want to also ship support for Gemini and Claude (and eventually local models like LLaVA).

I gave it this image:

image

And ran this:

llm -m 4v 'describe this image' -i image.jpg -o max_tokens 200

And got back:

This image shows a young pig being held by a person. The pig has a light brown coat with some bristle-like hair and a prominent snout that is characteristic of pigs. It appears to be a juvenile, given its size. The pig's snout is a bit dirty, suggesting it may have been rooting around in the ground, which is common pig behavior. The person is out of frame with only their arm visible, dressed in a red garment with a seemingly soft texture. They are holding the pig securely against their body. The background indicates that this is an indoor setting with wooden structures, possibly inside a barn or a similar animal enclosure.

simonw avatar Mar 04 '24 21:03 simonw

Lots still to do on this - I want it to support either URLs or file paths or - as an input but those should then be made available to the model such that models like GPT-4 that support URL images can pass the URL in directly, while models like Claude 3 that only support base64 fetch that URL and then send it base64 encoded instead.

Maybe have a thing with Pillow as an optional dependency which can resize the images before sending them?

Have to decide what to do about logs. I think I need to log the images to the SQLite database (maybe in a new BLOB table) because I need them in conversations so I can send follow-up prompts - but that could take a lot of space. So I need to add tooling that helps users clean up old images from their database if it gets too big.

simonw avatar Mar 04 '24 21:03 simonw

I am going to pass around an image object that has a .url property that may or may not return a URL string (otherwise None) and a .bytes and .base64 property that ALWAYS return binary data or that data base64 encoded.

That way plugins like OpenAI that can be sent URLs can use .url first and fall back to .base64 if the URL is not available, and plugins like Claude 3 can use base64 every time.

I'm tempted to offer a .resized(max_width, max_height) method which returns a Pillow resized image for models that know there is a maximum or recommended size limit and want to send a smaller request.

simonw avatar Mar 04 '24 23:03 simonw

Idea: rather than store the images in the database, I'll store the path to the files on disk.

If you attempt to continue a conversation where the file paths no longer resolve to existing images, you'll get an error.

simonw avatar Mar 06 '24 04:03 simonw

Would be nice if the API server gave you a reference for every uploaded image, that you could just refer back to

tomviner avatar Mar 06 '24 14:03 tomviner

came here looking for non-text API endpoints... i was hoping to have a direct view into the audio and text-to-speech API endpoints, in particular.

so while it would be nice to have llm have a chat-like interface to interleave images, maybe an easier first step would be to have just a simple "prompt-to-image", "prompt-to-audio", "audio-to-text" kind of commands?

anarcat avatar Mar 07 '24 04:03 anarcat

Quick survey on Twitter: https://twitter.com/simonw/status/1768445876274635155

Consensus is loosely to do image and then text, rather than text then image:

[{"type":"image_url","image_url":{"url":"..."}}, [{"type":"text","text":"Describe image"}]

simonw avatar Mar 15 '24 02:03 simonw

Claude 3 Haiku is cheaper than GPT-3.5 Turbo and supports image inputs - a great incentive to finally get this feature shipped!

simonw avatar Mar 15 '24 02:03 simonw

https://twitter.com/invisiblecomma/status/1768561708090417603

The Claude Vision docs recommend image first

https://docs.anthropic.com/claude/docs/vision#image-best-practices

Image placement: Just as with document-query placement, Claude works best when images come before text. Images placed after text or interpolated with text will still perform well, but if your use case allows it, we recommend image-then-text structure. See vision prompting tips for more details.

simonw avatar Mar 15 '24 12:03 simonw

the maximum allowed image file size is 5MB per image

Should I enforce this for the Claude model? Easiest to let Claude API return an error at first.

I'm not yet sure if LLM should depend on Pillow and use it to resize large images before sending them.

Maybe a plugin hook to allow things like resizing and HEIC conversion would be useful?

simonw avatar Mar 15 '24 12:03 simonw

https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/design-multimodal-prompts#prompt-design-fundamentals

Put your image first for single-image prompts: While Gemini can handle image and text inputs in any order, for prompts containing a single image, it might perform better if that image (or video) is placed before the text prompt. However, for prompts that require images to be highly interleaved with texts to make sense, use whatever order is most natural.

simonw avatar Mar 16 '24 15:03 simonw

the maximum allowed image file size is 5MB per image

Should I enforce this for the Claude model? Easiest to let Claude API return an error at first.

I'm not yet sure if LLM should depend on Pillow and use it to resize large images before sending them.

Maybe a plugin hook to allow things like resizing and HEIC conversion would be useful?

IMO llm should compress/resize images to avoid errors and make things easy to use. You can add an option --no-image-resize which disables this behavior, and people who care will disable it. The average user (myself included) just want the image to go the model, and the error is unhelpful.

BTW, OpenAI supports both low and high detail levels for processing images. Does Anthropic have sth similar? Is this exposed in llm?

NightMachinery avatar Mar 18 '24 14:03 NightMachinery

I made a simple cli for vision, if anyone needs it before llm-vision is ready. Only supports GPT4 for now. :( https://github.com/irthomasthomas/llm-vision

It supports specifying an output format that prompts the model to generate markdown, or json in addition to plain text. One thing odd about gpt-4-vision is that it doesn't know you have given it an image, and sometimes doesn't believe it has vision capabilities unless you give it a phrase like 'describe the image'. But, if you want to extract an image to json, then a text description isn't very useful. So, I prompt it with 'describe the image in your head, then write the json document'.

There's also a work-in-progress gpt4-vision-screen-compare.py - this takes a screenshot every few seconds and compares the similarity with the last screenshot and if different enough it sends it to the model asking it explain the changes between them.

And here's a demo of what you can do with it: https://twitter.com/xundecidability/status/1763219017160867840 Problem: I Want to import blocked domains list from kagi to Bing Custom Search.

  • Discovered that Bing Custom Search requires manual data entry of blocked domains.

Solution: A little bash script that: Screenshots kagi blocked domains list Gpt4-vision streams a text list of domains xdotool types the domains into bing webpage as they stream in.

irthomasthomas avatar Mar 29 '24 20:03 irthomasthomas

Current status:

  • Branch has -i support
  • I have GPT-4 Vision support, plus branches of llm-gemini and llm-claude-3

The main sticking point is what to do with the SQLite logging mechanism

It's important that llm -c "..." works for sending follow-up prompts. This means it needs to be able to send the image again.

Some ways that could work:

  • For images on disk, store the path to that image on disk. Use that again in follow-up prompts, and throw a hard error if the file is no longer visible.
  • Some models support URLs. For public URLs to images I can store those URLs, and let the APIs themselves error if the URLs are 404ing
  • Images fed in to standard input could be stored in the database, maybe as BLOB columns
  • But since being able to compare prompts responses is so useful, maybe I should store images from disk in BLOB too? The cost in terms of SQLite space taken up may be worth it.

simonw avatar Apr 04 '24 01:04 simonw

Very nice! I'm not sure I'd want to include the image in every turn, though. I send a lot of full screenshots and my poor connection doesn't help. What I do currently is generate the description with a python script and pipe that to llm to chat about it. If it's important I might include the file path in the prompt. Then the llm can act on the file, and I can I search for the file in the logs dB.

Cheers, Thomas

On Thu, 4 Apr 2024, 02:42 Simon Willison, @.***> wrote:

Current status:

  • Branch has -i support
  • I have GPT-4 Vision support, plus branches of llm-gemini and llm-claude-3

The main sticking point is what to do with the SQLite logging mechanism

It's important that llm -c "..." works for sending follow-up prompts. This means it needs to be able to send the image again.

Some ways that could work:

  • For images on disk, store the path to that image on disk. Use that again in follow-up prompts, and throw a hard error if the file is no longer visible.
  • Some models support URLs. For public URLs to images I can store those URLs, and let the APIs themselves error if the URLs are 404ing
  • Images fed in to standard input could be stored in the database, maybe as BLOB columns
  • But since being able to compare prompts responses is so useful, maybe I should store images from disk in BLOB too? The cost in terms of SQLite space taken up may be worth it.

— Reply to this email directly, view it on GitHub https://github.com/simonw/llm/issues/331#issuecomment-2035960595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE476NAPPZBB4BQDLUGVRU3Y3SV2HAVCNFSM6AAAAAA7AKZ476VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZVHE3DANJZGU . You are receiving this because you commented.Message ID: @.***>

irthomasthomas avatar Apr 04 '24 11:04 irthomasthomas

@simonw Just add an option --image-log-mode which can be set to db-blob. By default, don't store them, it will take disk space for probably junk files.

NightMachinery avatar Apr 04 '24 12:04 NightMachinery

Another open question: how should this work in chat?

I'm inclined to add !image path-to-image.jpg as a thing you can use in chat to reference an image.

But then should it be submitted the moment you hit enter, or should you get the opportunity to add a prompt afterwards? I think adding a prompt afterwards makes sense.

Also should !image be allowed inside !multi? I'm not sure. If it IS, then how would you send that raw text to a model e.g. as part of a longer code sample you are pasting in?

simonw avatar Apr 04 '24 21:04 simonw

@simonw Just add an option --image-log-mode which can be set to db-blob. By default, don't store them, it will take disk space for probably junk files.

Yeah I'm beginning to think I may need to had a whole settings/preferences mechanism to help solve this. llm settings set image_log_mode blob kind of thing.

simonw avatar Apr 04 '24 21:04 simonw

@simonw

I'm inclined to add !image path-to-image.jpg as a thing you can use in chat to reference an image.

Perhaps you can use a TUI hotkey? E.g., Ctrl-i for inserting images. Though this will quickly spiral out of control ... E.g., should the TUI present a dialogue for selecting files?

The ideal case is to be able to just paste, and detect images from the clipboard. But this seems impossible to do using native paste. Perhaps you can add a custom hotkey for pasting that checks the clipboard.

I have some functions for macOS that paste images, e.g.,

class='«class PNGf»'
osascript -e "tell application \"System Events\" to ¬
                  write (the clipboard as ${class}) to ¬
                          (make new file at folder \"${dir}\" with properties ¬
                                  {name:\"${name}\"})"

NightMachinery avatar Apr 04 '24 23:04 NightMachinery

For pasting I think I'll hold off until I have a web UI working - much easier to handle paste there (e.g. https://tools.simonwillison.net/ocr does that) than figure it out for the terminal.

It would be good to get this working though:

pbpaste | llm -m claude-3-opus 'describe this image' -i -

Oh, that's frustrating: it looks like pbpaste only works for text content, I tried pbpaste > /tmp/image.png and got a 0 byte file.

ChatGPT did come up with this recipe which seems to work:

osascript -e 'set theImage to the clipboard as «class PNGf»' \
  -e 'set theFile to open for access POSIX file "/tmp/clipboard.png" with write permission' \
  -e 'write theImage to theFile' \
  -e 'close access theFile' \
  && cat /tmp/clipboard.png && rm /tmp/clipboard.png

I imagine there are cleaner implementations than that. Would be easy to wrap one into a little zsh script or similar.

simonw avatar Apr 04 '24 23:04 simonw