dspy
dspy copied to clipboard
feature(dsp) Added support for multimodality in googlevertexai client
This PR is a follow up on top of 616 where the support of googlevertexai client was added.
I found out with the recent PR it only works for text-based models (like text-bison or gemini-pro), but doesn't work for multimodal models like gemini-vision-pro, because only a text prompt is passed to the googlevertexai generate_content
method. However generate_content
can take as input either only text or text and image when multimodal model is used. See api docs. I am adding the support for multimodality to use such models like gemini-vision-pro.
TLDR:
- added a flag
is_multimodal
ifvision
is present in the name of the model. Refers to models like "gemini-pro-vision", "gemini-1.5-pro-vision" and other similar namings that are used by Google. - added an optional parameter
image
to thebasic_request
and and option to pass both image and text to thegenerate_content
if all required conditions are met (is_gemini
andis_multimodal
) - supported history tracing with images for multimodal models
- linters were applied and reflected some minor changes, let me know if we need to get rid of this.
Thanks @ieBoytsov for this PR! The changes look fine to me, but I want to note that DSPy is not fully functional with multi-modal capabilities just yet. It seems like this code would work, but you wouldn't be able to use the "image" response beyond accessing the model history without making other changes to DSPy functionality.
Currently open PRs for multi-modality for reference: #792, #682. We'll likely need a vision LM backend to better support this
Any idea on how to structure the Signature so we can pass in mutlimodal payloads?
It'd probably make sense to add support beyond "vision" models and support full multi-modality as with gemini-1.5-flash-001
.
One could see having prompts such as [instructions, ex_doc1, ex_text1, ex_doc2, ex_text2, input_doc]
It'd probably make sense to add support beyond "vision" models and support full multi-modality
Totally agree. I currently don't work on the PR because @arnavsinghvi11 told there is a support of VLLM backend required which is not in the scope of this PR. I am waiting when guys finish.