dspy feature(dsp) Added support for multimodality in googlevertexai client

feature(dsp) Added support for multimodality in googlevertexai client

Open ieBoytsov opened this issue 10 months ago • 4 comments

This PR is a follow up on top of 616 where the support of googlevertexai client was added.

I found out with the recent PR it only works for text-based models (like text-bison or gemini-pro), but doesn't work for multimodal models like gemini-vision-pro, because only a text prompt is passed to the googlevertexai generate_content method. However generate_content can take as input either only text or text and image when multimodal model is used. See api docs. I am adding the support for multimodality to use such models like gemini-vision-pro.

TLDR:

added a flag is_multimodal if vision is present in the name of the model. Refers to models like "gemini-pro-vision", "gemini-1.5-pro-vision" and other similar namings that are used by Google.
added an optional parameter image to the basic_request and and option to pass both image and text to the generate_content if all required conditions are met (is_gemini and is_multimodal)
supported history tracing with images for multimodal models
linters were applied and reflected some minor changes, let me know if we need to get rid of this.

Apr 17 '24 14:04 ieBoytsov

Thanks @ieBoytsov for this PR! The changes look fine to me, but I want to note that DSPy is not fully functional with multi-modal capabilities just yet. It seems like this code would work, but you wouldn't be able to use the "image" response beyond accessing the model history without making other changes to DSPy functionality.

Currently open PRs for multi-modality for reference: #792, #682. We'll likely need a vision LM backend to better support this

Apr 28 '24 19:04 arnavsinghvi11

Any idea on how to structure the Signature so we can pass in mutlimodal payloads?

Jun 27 '24 21:06 felixgao

It'd probably make sense to add support beyond "vision" models and support full multi-modality as with gemini-1.5-flash-001.

One could see having prompts such as [instructions, ex_doc1, ex_text1, ex_doc2, ex_text2, input_doc]

Jul 05 '24 17:07 glesperance

It'd probably make sense to add support beyond "vision" models and support full multi-modality

Totally agree. I currently don't work on the PR because @arnavsinghvi11 told there is a support of VLLM backend required which is not in the scope of this PR. I am waiting when guys finish.

Jul 07 '24 20:07 ieBoytsov

dspy dspy copied to clipboard

feature(dsp) Added support for multimodality in googlevertexai client

dspy
dspy copied to clipboard