dspy
dspy copied to clipboard
Pydantic vision module to build off of and enable multimodal capabilities for all providers.
Hi all,
This library is amazing! I added something similar to the pydantic Image class we use to abstract away all those messy conversions you need to do.
Question, why poetry? I am team hatch all the way and not sure what to do for poetry.lock file.
The problem with multimodality, however is that joblib won't cache correctly unless we encode to png (which is easy to do with the vision.py
module but a better use would be "close enough" similarity maybe via CLIP embeddings?)
If you all like, I can integrate more of our backend api such as a Message
class that has to_openai(), to_anthropic() etc. methods to abstract away all of those api specifics and allow for interwoven image
and text
inputs.
Cheers!
So after testing in the main branch, this also fixes some circular import bugs that aren't being tested currently.
really interesting to see vision here, would be very exciting to see it used in context, for example with an evaluation function and example
You need only wait by end of day, application is in-context learning for robotics :)
There is a duplicate PR with #682. I have no personal preference for either one, and it seems like yours is aimed with more modalities in mind.
Tagging @KCaverly @CyrusOfEden to see if we should delay this until backend refactor
I believe the LiteLLM backend should interface with gptv, but I imagine we will need to accomodate for this with a distinct backend, so more work would be required, but we should be able to leverage this work to speed that up.
Added examples/multimodal/visual_question_answering.ipynb and tested api calls.
@isaacbmiller Could you provide more details about the plans for the backend refactor?
To get multimodal Calude and HF I wanted to add the following Pydantic models with to_
Message role: str content: list[Prompt]
Prompt text: str image: Image
We do this internally but there are probably better design approaches.
@sebbyjp @arnavsinghvi11 I'm eager to test this. Is it possible to compile datasets with image inputs at this stage, to perform prompt optimisation?