dspy icon indicating copy to clipboard operation
dspy copied to clipboard

feature(dsp) Added support for multimodality in googlevertexai client

Open ieBoytsov opened this issue 4 months ago • 1 comments

This PR is a follow up on top of 616 where the support of googlevertexai client was added.

I found out with the recent PR it only works for text-based models (like text-bison or gemini-pro), but doesn't work for multimodal models like gemini-vision-pro, because only a text prompt is passed to the googlevertexai generate_content method. However generate_content can take as input either only text or text and image when multimodal model is used. See api docs. I am adding the support for multimodality to use such models like gemini-vision-pro.

TLDR:

  1. added a flag is_multimodal if vision is present in the name of the model. Refers to models like "gemini-pro-vision", "gemini-1.5-pro-vision" and other similar namings that are used by Google.
  2. added an optional parameter image to the basic_request and and option to pass both image and text to the generate_content if all required conditions are met (is_gemini and is_multimodal)
  3. supported history tracing with images for multimodal models
  4. linters were applied and reflected some minor changes, let me know if we need to get rid of this.

ieBoytsov avatar Apr 17 '24 14:04 ieBoytsov

Thanks @ieBoytsov for this PR! The changes look fine to me, but I want to note that DSPy is not fully functional with multi-modal capabilities just yet. It seems like this code would work, but you wouldn't be able to use the "image" response beyond accessing the model history without making other changes to DSPy functionality.

Currently open PRs for multi-modality for reference: #792, #682. We'll likely need a vision LM backend to better support this

arnavsinghvi11 avatar Apr 28 '24 19:04 arnavsinghvi11