gptel Add basic OpenAI vision support

Add basic OpenAI vision support

Open daedsidog opened this issue 7 months ago • 1 comments

I found myself requiring the use of vision models sometimes because it would otherwise be very unwieldy to describe my problem in text. I kept retreating back to the online interfaces where a lot of them don't even allow you to paste the images directly, and instead force you to first save and upload them.

There's a feature branch which adds OpenAI vision features based on the discussion in https://github.com/karthink/gptel/discussions/231 to org-mode, but its only usable via org-mode and in the dedicated chatting buffer.

This adds a transient menu option to send the model a one-time image:

To make this multi-modal, I had to forfeit ability to track image history, so you can only have one image "known" at a time.

I don't consider it a big limitation considering the usefulness. I wrote a small script to be able to send it disposable clipboard images, and here's how it looks like:

https://github.com/user-attachments/assets/57c38d81-e676-4d2b-82e3-c3464bf2b84f

https://github.com/user-attachments/assets/9fe36f20-ea65-4cf3-8071-3b9a28fc00bd

I only touched the OpenAI backend. There's currently no indication for the lack of support in the other ones.

The image can be a regular local image or a URL to one.

I didn't give too much thought to how it would fit inside the transient menu layout, so that would probably need to be adjusted (I also think now that -I would work much better for it than -p).

Extremely long image filenames look clunky as they drive away the rest of the transient columns. A new variable type needs to be created?

Jul 26 '24 06:07 daedsidog

gptel gptel copied to clipboard

Add basic OpenAI vision support

gptel
gptel copied to clipboard