instructor_ex icon indicating copy to clipboard operation
instructor_ex copied to clipboard

Multimodal support for Gemini

Open samrat opened this issue 1 year ago • 0 comments

Currently, it's only possible to send text messages using the Gemini adapter:

https://github.com/thmsmlr/instructor_ex/blob/1abd8473d05111c11a4d9033b6a88acc29737fa0/lib/instructor/adapters/gemini.ex#L61

The Gemini API supports image, video and audio inputs(unlike the OpenAI API where you send the file contents base64-encoded, you need to upload the file separately)

Would you be open to a PR that adds support for uploading files, or would you say that is out of scope of this project?

If it's out of scope, I can create a smaller PR that allows media URLs(with the upload happening outside the library):

Instructor.chat_completion(
  mode: :json_schema,
  model: "gemini-1.5-flash",
  response_model: VideoDesc,
  messages: [
    %{
      role: "user", 
      content: [
        %{
          type: "video_url",
          video_url: %{
            url: "https://generativelanguage.googleapis.com/v1beta/files/..."
          }
        },
        %{
          type: "text",
          text: " what's going on in this video?"
        }
      ]
    }
  ]
)

samrat avatar Nov 12 '24 15:11 samrat